Season 2, Episode 4
Hyperparameter and Neural Architecture Search
Liam Li is a leading researcher in the fields of hyperparameter optimization and neural architecture search, and is the author of the seminal Hyperband paper. In this session, Liam discusses the evolution of hyperparameter optimization techniques and illustrates how every data scientist can benefit from neural architecture search.
Liam Li recently completed his PhD in Machine Learning from Carnegie Mellon University, where he was advised by Ameet Talwalkar. His thesis on efficient methods for automating machine learning showcases his work on Hyperband, large-scale hyperparameter tuning, and efficient neural architecture search. Since then, he joined Determined AI as a machine learning engineer to build a cutting-edge platform for deep learning, enabling users to be vastly more productive and happier. He continues to be involved in the research and AutoML community and is a co-chair for the 2nd ICLR workshop on Neural Architecture Search.
Welcome to Data Brew by Databricks with Denny and Brooke. This series allows us to explore various topics in the data and AI community. Whether we’re talking about data engineering or data science, we’ll interview subject matter experts to dive deeper into those topics, and while we’re at it, please do enjoy your morning brew. My name is Denny Lee and I’m a developer advocate at Databricks.
Hello everyone. My name is Brooke Wenig. I’m the other co-host of this show, and I’m the machine learning practice lead at Databricks. And today, I have the honor to invite my longtime friend, Liam Li, to the show. Liam and I were grad students together at UCLA, and Liam taught me so many things about machine learning. I will never forget the linear algebra lessons late nights in the lab. I would love for you to introduce yourself.
Yeah. Thanks for having me on the show guys. So a little background about me. I recently completed my PhD in machine learning from CMU, where I worked with my advisor, Ameet Talwalkar on AutoML research in developing more efficient methods to help people find models for their specific tasks and problems faster and with less compute. That’s a little bit about myself. Since then, I’ve joined Determined AI as an applied machine learning engineer, and I continued to work on some of the same problems that I did during my PhD, except now with more of a product focus, so really building out user-friendly tools for machine learning scientists to just be a lot more productive than they would be if they were kind of spinning up their own tools to do distributed training or cloud management and so on. So, yeah. Great to be here and excited to talk to you guys today.
Awesome. So, taking a step back before your PhD, what got you into the field of machine learning? What excites you about this field that made you want to get a PhD focusing on it?
Yeah, I would say I’m kind of a late bloomer. I studied applied math in college and I didn’t really do that much coding until my first couple jobs in economic consulting and finance. At the time, I wasn’t that excited about finance, and what I realized was that, what I liked the most about my job was the data analysis and modeling components. So, that’s kind of what made me start looking into data science, machine learning as a career. What I realized was all the roles that I wanted required at least a master’s, so at that point I was like, “Ah, I guess, I have to go back to school, get a degree in computer science or something, so that I’ll have kind of more of the skillset that I would need for the roles that I was interested in.”
So I applied for a Master’s Programs in Machine Learning and Computer Science, and Ameet saw my application to UCLA, which is where he was at the time and he asked me whether I wanted to do a PhD with him in machine learning. So that’s kind of how I got into a machine learning and started on the PhD journey. Definitely not the traditional route, I would say, but I think it’s been a great experience. So I think anyone who’s interested in machine learning and wants to make the switch, I think it’s definitely doable. I myself made that switch from kind of more just a regular analyst position to machine learning scientist, so definitely go after it if that’s what you’re excited about. And yeah, I think the future for machine learning is only going to get brighter, more interesting, more exciting, so now is a good time to join.
Excellent. There’s a lot to unpack actually. For anybody that’s listening right now, we’re going to actually eventually talk about the Hyperband paper which is super popular. Liam, you talk about just the concept of machine learning in general, right? How did you get there because there’s a lot of steps to get there. First there’s hyperparameter tuning, then all of everything that goes on there that led you to the Hyperband paper. So I’m just curious, like why specifically this area? Can you tell a little, for the folks who are listening, a little bit about what are the problems around hyperparameter tuning in the first place.
Yeah. I think there are a lot of terms that get thrown around these days, artificial intelligence, machine learning, data science. At the end of the day, what we’re all trying to do is make sense of the data that we have so that we can use it to make predictions down the road, right? In terms of how we get to something that helps us make those predictions, usually that involves identifying a good model for my data and then using that model to generate the downstream predictions, right? So when you’re looking at the modeling problem, there are a lot of different tools you have at your disposal. There are a lot of different modeling types. There are a lot of… Or even going before the modeling process, you’re looking at things like how do I generate features or which features of the data are important and so on.
So there are just a lot of different techniques and tools from a feature creation to modeling to the prediction, right? So all the different choices that you have are in some sense hyperparameters, right? So not only are the techniques and approaches themselves hyperparameters, but they also had their own kind of knobs that control how they behave. Then that introduces even more hyperparameters. So, really at the end of the day, it becomes this mix of a data scientist machine learning engineer and taking their domain knowledge about what approach has worked well for a particular problem and narrowing down the set of techniques they’re considering, and then using hyperparameter tuning approaches to fine tune that set of techniques modeling types that had their own associate hyperparameters so they can achieve the best performance on kind of the downstream prediction problem that they care about.
You can think of hyperparameter tuning as a kind of a wrapper algorithm around this search process of finding what the best model and associated hyperparameters are for your problem. It’s just a way to automate that as much as possible, and it perform that search with as few computational resources as possible.
Right. So you’re trying to imply the fact that you didn’t want us to manually go ahead and do a checklist and go through each and every set of parameters yourself, right? So you’re trying to avoid that for all.
Right. Like if you’re training just at convolutional network, right? You have to think about what sort of learning rate and how do I configure my optimizer for a gradient descent, what sort of regularization should I apply to the weights, should I use dropout, how much dropout? All these questions are kind of concerning the hyperparameters, right? And you’re trying to tune them so that you get the maximum predicted performance for the problem that you’re interested in.
Right. So then, that naturally leads us to like what’s the… I guess, currently popular or at least maybe popular two years ago is Bayesian optimization, so we would use those techniques as a method, as our way to try to figure out what the heck’s going on, or try to basically optimize our hyperparameters. Can you tell us a little bit then maybe what are the pros and the gotchas for using those types of techniques? And then I think that will naturally segue to Hyperband in that case.
As practitioners, the go-to techniques are just very simple kind of brute force methods, like random search and grid search. So here, you kind of have some strategy for deciding which high-performers settings you should try. So grid, you have evenly spaced points and your search space random you’re sampling randomly from some predefined distribution. If you look at these brute force method, one natural question is can we hope to do better by being smarter about which configuration so hyperparameter settings we want to evaluate, right? Can we use past experience past information about how different hyperparameters settings performed before to make an informed decision about what we should try next in hopes of maximizing the predictive performance, right? That’s what Bayesian optimization methods tried to do. They have some like internal model of what the performance say validation performance is of different hyperparameters in the search space is, so there’s like an underlying model that is being updated and maintained as you get more and more data about the performance of different hyperparameters settings, right?
So you evaluate a certain hyperparameter setting, get a metric of how well it performs, feed it into your Bayesian optimization models so that you can update what the surface looks like over a hyperparameter space, and then you can use that model to help guide select new hyperparameter configurations that are more likely to do well, right? That’s kind of how Bayesian optimization approaches try to speed up the search for a good hyperparameters setting is by using that knowledge and having this modeling surface that helps guide a selection of better hyperparameters as you see more and more hyperparameters being evaluated.
So while Bayesian optimization allows you to leverage previous experiences with your different hyperparameter configurations, what about specifying the bounds? Like, do you have to tell it, these are the hyperparameters that I want to tune, or these are the allowable range of values that I want you to stay within. How do you incorporate that information?
Yeah. You basically specify it through the search space that you define, right? In the case of learning rate, you can say something like, “I only want to consider learning rates in between a certain range,” and that is information that you feed to your function that is being fit over the search space, right? All that is done in the search space specification step, so even before you apply any of these algorithms, you have to say what your search space for hyperparameters are, right? So for learning rate, you specify a range. For weight decay, you specify a range. For dropout, you specify a range. All of that is fed into the hyperparameter tuning algorithms. So if you’re using random search, you will sample from this constrained range that is predefined, right? The same thing with patient optimization, the function or the model that’s being fit over the search space is also constrained by the same bounds.
So, going back to a structured grid search, one of the key drawbacks of it is that we can just specify a really stupid search space that we want to compute a bunch of models over. Do you ever foresee any libraries incorporating these best practices of like, if you’re going to build a random forest, don’t try to do depth 40, or don’t try to do a learning rate of 50. How do we ensure that when humans are designing the search space they use reasonable search value?
That’s a good question. I think if you’re looking at existing tools out there for an AutoML, so there are libraries like auto-sklearn that search over the most popular methods for either classification or regression from the scikit-learn library. That tool has kind of predefined ranges for the hyperparameters already, because that’s kind of the domain knowledge that a data scientist brings. So if you’re using a tool like that, that part is kind of done for you. Otherwise, It’s in some sense, domain knowledge that you build through experience in manually built a training these models in the first place, right? I would say the search ranges should also be data specific, right? There isn’t a rule of thumb search based that’ll work well for every single data set, right?
I think there’s always going to be some element of trial and error where you spend a little bit of time finding what a reasonable range for the parameters are, and then you kind of search around that range of values. There’s still going to be some trial and error. I think looking at the default parameters for the models you’re considering in any of the libraries that you use is a good place to start. So if you look at scikit-learn defaults, if you look at MLlib defaults, right? So, sticking to those defaults as a starting point, and then kind of expanding the search base around the defaults is a good way to get started in identifying what your search they should be.
Got it. Well then I think that this is probably a good segue into… Now, how does the Hyperband approach different from Bayesian optimization?
Yeah, so Hyperband is distinguished by its use of early stopping. If you go back to Bayesian optimization, what we’re trying to do is be adaptive in how we select hyperparameter settings to evaluate, so that’s where the additivity is. It’s in the selection. What Hyperband tries to do is… we are not being adaptive and how we select hyperparameters, but we are adaptive in how much training resource we allocate to different hyperparameter settings, right? So the actual hyperparameter settings that are being considered are still drawn randomly from a search space, but the algorithm is deciding which configurations to train further and which ones to stop training, right? That’s where the early stopping component comes in. So as a human, if you’re manually tuning a model, you might just try specific settings and they don’t seem to be working well, you’ll stop, switch around some values run again.
The early stopping paradigm for hyperparameter tuning is trying to replicate that same sort of logic, right? So if a particular setting is not doing well, then we shouldn’t allocate more resources to it and we should instead focus on training the hyperparameters settings that appear to be more promising. That’s the key idea behind Hyperband. I think what differentiates Hyperband from other early stopping approaches like the median early stopping rules commonly used… And what differentiates Hyperband is that it is theoretically grounded in analysis that we’ve done to show. If you use the Hyperband algorithm, you’re guaranteed to find a good configuration as long as you allocate enough resources to it. I think that gives people more comfort in using the algorithm because of the theoretical underpinnings, which I think might be a little bit more technical than we wanted to discuss here, but I do think it has helped with just people being more comfortable using the algorithm.
That algorithm itself is actually very simple. So Hyperband uses this what’s called a multi-armed bandit algorithm, called success of having, and it’s really simple. You start off with some set of hyperparameter configurations you’re considering. You allocate a very small amount of training resource to all of them. So this can be like training your model for one epoch for all the hyperparameters settings you’re considering and evaluating them after training one epoch, and then throwing away the worst half of performing hyperparameter configurations. Then in the next round, you just allocate more training resources to the remaining configurations. You do this until you’re left with say one hyperparameter setting that’s the best from the set that you started with. So the algorithm itself is very intuitive and very easy. I think a lot of the more challenging bits was the actual theoretical analysis that was done to show that it’s provably correct.
I really love the approach of Hyperband. It’s so simple and it’s implemented in a lot of popular libraries like TensorFlow. I really like this idea of don’t waste all of your compute. If you know that that model is not going to perform well, just cut your losses early and instead double down on the models that are performing well. It’s just such a simple and elegant idea, and I’m curious, how did you come up with this idea in the first place?
Yeah. Throughout the course of my work, there’s always been this theme of simple is better and simple actually performed surprisingly well, right? So even before I started working on the problem there was a paper by Burger, things like random search for hyperparameter tuning which basically showed that random search is very competitive, and there are situations in which random search performs almost as well as Bayesian optimization. It’s also much better than grid search. So that paper I think really was the inception of this idea that random search by itself is already really good. So then our question is how can we improve random search and early stopping kind of seem to be a really natural thing to try because it is already what we do as grad students, right? Manually tuning hyperparameters. A lot of it is like early stopping and just domain knowledge, trying a bunch of different things and eventually kind of coming to a global or a local optimum or around the search base that we’re considering.
So, early stopping things seemed like a really good technique to combine with random search, and then the question was how do we automate this early stopping process so that researchers practitioners can focus on the more interesting aspects of model development instead of just trial and error, manually tuning hyperparameters for these models, and it works really well. I think when we get to the next section where we talk about neural architecture search, I think random search will show up again. It’s just one of those things where the simplest method oftentimes is just a very strong baseline.
So we’ve got an excellent segue coming from the context of random search, what exactly is neural architecture search in that case?
Yeah. Neural architecture search is exactly what the name suggests. So with all the focus on deep learning these days, I think a lot of people are curious or are wondering how do all these architectures come to be, and is there kind of modifications that we can make to popular architectures to further improve the performance, right? If you look at InceptionV3 or ResNet, or even before that AlexNet, right? So all these architectures seem somewhat arbitrary. Like why do we put convolutions in specific layers? Why are there a specific number of filters and channels and so on? I think that’s a natural question to ask is how do we design architectures and how can we kind of maybe perform some local search around known architectures to further boost performance. So right now, at least that’s the type of questions that neural architecture search is answering.
So the search spaces are in some sense, what differentiates neural architecture search from just your typical hyperparameter tuning problem, but I think at the end of the day, neural architecture search is just another hyperparameter tuning problem where we’re considering a very specialized search space where the hyperparameters all control what the architecture looks like, right? So we’re not necessarily thinking about how to optimize the learning rate of the optimizer or regularize the model with weight decay. We were thinking more along the hype, like how do we tune the hyperparameters that control the actual architecture, right? So if you take a very simple search space for convolutional architecture is you could be looking at kind of how many convolutional layers, how many channels in each layer, what sort of pooling function should I use? So, that’s kind of what a simplified neural architecture search base can look like.
The search bases that are being used in the latest research are much more complicated, have a lot more hyperparameters. So you are looking in effect at quadrillions of possible architectures in a particular search space that you’re interested in, right? So again, just to summarize what I said there because it was a lot, it’s just another hyperparameter tuning problem, except now the search space you’re considering is specialized for different architectures.
How feasible is it for everyday data scientists to leverage neural architecture search? There is a paper I read a while back that they’re claiming state-of-the-art results, but they’re training on 800 GPUs concurrently, and I know I get yelled at if I use more than like four GPUs concurrently, how common is it for everyday data scientists to use neural architecture search?
Yeah, I think that’s a really good question, and that question has motivated a lot of more recent work in neural architecture search. If you look at the first generation of neural architecture search approaches, they were all these experiments methods were coming out from Google, and they would spend thousands of GPU days to search for these architectures. And in the beginning, it was very expensive to do in your architecture search because all of the algorithms that people used actually required training individual architectures for some number of epochs and evaluating it before updating a policy that would be used to select a new architecture to evaluate. Since then, a lot more efficient methods have been developed. So now we’re talking on the order of say one GPU day to do architecture search, right? There’s definitely work in making neural architecture search more applicable and more useful for the general practitioner.
So it’s be on the lookout for any of these more efficient and also approaches. I would say though, that the companies that are doing a lot of our industrial scale research into NAS have been fairly good about open sourcing the architectures that they’ve discovered. So you can go and take one of these state-of-the-art architectures published by Google and apply them to your own dataset, right? So they will release the architecture along with pre-trained weights. It is really easy to benefit from the computation that other companies have already poured into performing architecture search.
Yeah, so that’s all to say that I think there is definitely benefit in using some of the results from neural architecture search research in your own sort of modeling process, right? So whether it’s using some pre-trained weights or using some of these more efficient methods and applying them to your own data sets, I think it could be worth investigating some of these possibilities for computer vision in particular and more recently, I think an LPN transformers but there’s definitely less work I think in those areas.
Yeah. That definitely makes sense to leverage the architectures that these companies with tons of compute are generating. I’ve seen some of them and they have skipped connections going from like layer five to 17 and to 21. Why they pick those layers? I have no idea, but it performs better. So that, that’s definitely a really helpful piece of advice of just leverage the architectures that they’re discovering, even if you yourself are not actually doing any neural architecture search.
I wanted to follow up on that. Since you’re doing research in this area, do you have any general guidance for architecture designs or hyper architectural patterns that you tend to search over different combinations of them?
One application area where neural architecture search really shines is in customizing architectures for specific deployment settings, right? Given the hardware constraints on different phones or different edge devices, you won’t be able to fit the same model on every single device, so you want to find kind of smaller and more efficient architectures for specific deployment scenarios, right? That’s I think where neural architecture search is able to offer a lot of benefit relative to just using say a fixed architecture, but scaling the channels or scaling the width down to fit on specific devices.
That’s kind of the simplest approach, same architecture, and just like scale things relative to the deployment constraint with NAS you’re able to do something that’s more fine-grained and this once and for all approach trains a single what’s called super network that includes all possible architectures in your search base, and then you can use that super network to find architectures to fill up an entire parade or front of trade-offs between accuracy and say latency or accuracy and memory usage. You can just do that by sampling architectures, passing them into the super network that’s already pre-trained and using the super network to get a signal for performance under certain constraints. Once you create this parade of front where you kind of try to maximize the accuracy, say for every single slice of like inference point, so particular inference speed that you care about, and then you can have like a range of different inference speeds maximize the accuracy. You can just take those best architectures and push them to the devices that you care about, right?
That’s I think one application of NAS that’s really exciting and I think potentially really useful for people and this once and for all approach. Again, you’d just do the SuperNet training once, and you’re able to use this one network to search architectures for a bunch of different deployment settings, right? So the cost of the architecture search is amortized over all the possible deployment settings that you’d want to consider. So I think that’s a really exciting way for people to start using NAS system.
Wow, this is extremely interesting to say the least I’m going to switch gears just because we’re taking a lot of your time and we really appreciate it on a completely different note here. What’s the piece of advice that you would have probably want to given yourself actually, as you had started this journey into machine learning? That is, yeah.
Yeah. I think there are two points where I could have given some really valuable advice to myself. One point is during my PhD, I think it’s natural to have doubts during a PhD about whether you’re going to make it, whether you can publish enough to graduate, and I think I would have just told myself to focus on the journey and not the destination to quote from Brandon Sanderson if people read that series, but yeah, the journey I think is much more enjoyable if you’re not constantly thinking about, Oh, am I going to graduate? Am I going to be able to publish? I think regardless of where the low points might’ve been for me, the PhD was a really great experience and I learned a lot throughout the journey, so I think it’s really important to focus on that aspect.
The other thing I would tell myself, and this is before I even started a PhD, I think I would have told myself to just go and do it, right? I think I was somewhat intimidated by machine learning before I got into it, and I think I would have been less intimidated if I just got my hands dirty and tried a bunch of different things. I think that’s even more true now than it was five years ago with a lot of the open source tools and libraries that are available. It’s very easy to try machine learning and build machine learning solutions for a bunch of different applications, computer vision, object detection, NLP, whatever. That’s all super easy now, and if you’re interested, I think you can get really far with just a little bit of programming experience and basic knowledge of machine learning. So yeah, I would just tell myself to go and do it and not make excuses for why I couldn’t have gotten started five, six, seven years ago. So yeah.
That’s some super helpful advice. I really hope everybody takes some that message of just go out and do it, try out different hyperparameter tuning techniques, try out neural architecture search, just go out there, learn and do it. I wish I had that mantra when I was an undergrad wanting to get into machine learning, but just not knowing where to start. So thank you again so much for your time, Liam. I always learn so much from every conversation that we have, and thank you again for joining us and sharing your expertise on hyperparameter tuning and neural architecture search.
Yeah, definitely. Thank you guys so much for having me.