How to Use Big Data Systems to Democratize LED and other Census Data


Coordinator: Welcome and thank you for standing
by. All participants will be able to listen only
until the question – and – answer portion of today’s conference. To ask a question please press star 1. Today’s conference is being recorded. If you have any objections please disconnect
at this time. I will now like to turn the conference over
to Ms. (Earlene Dowell). Miss, you may begin. (Earlene Dowell): Thank you (Julie). And thank you to Lisa Glover from the U.S
Census Bureau for hosting our webinar today. On behalf of the U.S Census Bureau and our
partnership with the Council for Community and Economic Research and the Labor Market
Information Institute welcome to our first LED webinar of 2020. It is with great pleasure that I introduce
one of our esteemed presenters from the 2019 LED Partnership Workshop Vivian Zheng as she
presents how to use big data systems to democratize LED and other Census data. The Data Science Team at Urban has developed
a big data system that allows researchers to easily access and analyze LED data. In addition the team has used the system,
to create national summary files of the LED data at the track and place level. In this presentation Zheng will discuss the
systems built to read and process the data, why it’s valuable to the researchers, the
use cases seen so far from our researchers and partners and how researchers can access
the summarized data. Vivian Zheng is a Data Science Analyst in
the Office of Technology and Data Science at the Urban Institute where she works with
researchers to improve access to data analytic tools and innovative research methods. Zheng is programming expert with experiences
in natural language processing, machine learning, big data processing, web scraping and open
data portal management. She holds a Bachelors of Management in Labor
and Social Security from Zehang University and earned an MPP from Georgetown University. With that I hand it over to Vivian. Vivian Zheng: Hi everyone. So first of all thanks for having me here
and my name is Vivian Zheng. I’m a Data Scientist from the Urban Institute. So today I would like to share how Urban uses
the big data system to democratize access to the LED data. So first a little brief introduction about
the Urban Institute and our data science team. So the Urban Institute is a not – for – profit
policy research organization in Washington D.C. and we have 12 policy centers and with
over 500 people and we conduct a variety of different policy research. Some big areas are the metropolitan housing
and the community policy income and its benefit policy, health policy and education data policy. So for our data science team we work very
closely with the researchers across Urban to try to apply the balance data science technologies
and the methods to help the researchers to make their work much more effective with data. So some of the examples are we use the – for
example, we use the natural language processing method and technology to analyze the landed
used reform news articles. And we also use the machine learning method
to categorize the Twitter data information based on the research topic. And recently we also introduced a lot of the
cloud computing technologies and assistance to enable the researchers to process a huge
amount of data within a short amount of time. So one of the primary goals for our data science
team is provide the researchers with easy and a trusted access to data. So as we all know the LODES data under the
LEHD program and is a really great and valuable resource for the researchers and the local
governments. So a brief introduction about the LODES data. It contains very detailed spatial distribution
of workers employment and the residential locations and the relations between the two. And all the information are at the Census
block level and its State based and it contains – its organized into three different types
including the residential area characteristics, workplace area characteristics and origin
destinations. And the Census Bureau puts great effort into
putting this data set together and make it public and our work at Urban is trying to
make the LODES data even more valuable and accessible to the researchers in Urban and
also whoever who are interested in this data set. And the raw form of this LODES data sets are
in a set of over 75 cells in the CSV files. So normally the researcher will need to go
to the LODES website, manually download all the files, unzip them and process them from
there. So – but it can take a lot of time and also
computer memory and the storage spaces to do that. So one of our goals at Urban data science
team is to help simplify the data downloading, processing for the researchers. And apart from the block level LODES data
the track and place level LODES data is also very popular among the researchers at Urban
because a lot of the policy research are conducted at the track and the place level. So the second goal for us is to summarize
the block level LODES data to place and track level and also make it public at Urban’s data
catalogue. So Urban’s data catalogue is the central place
where Urban publish the data. And it’s an open data portal launched by Urban
last year and everyone basically can go there and download the files that we published. So next I would like to go through the steps
that we process the LODES data and the underlying technology that we use in this pipeline. So first as I just mentioned we automate downloading
all the LODES files. And the secondly we would like to convert
all the CSV files into the Parquet formats. I will mention more in detail about what the
Parquet format is. And then we will summarize all the block level
data to the place and track level and make it available at Urban’s data catalogue. But internally at Urban the researchers can
also analyze the block level LODES data using the Cloud Clusters through Urban’s big data
system. I will also mention that in details later. So first automate downloading the LODES files. So if we go to the Census Bureau LODES website
we can see this really user – friendly interface. And then under this download LODES data we
can select the state we are interested in and the types of jobs we are interested in,
and we can see a list of files will pop up here. And from there we can manually download all
the files that we are interested in. But the Census Bureau website also put up
that really useful FPT program which is a specific type of directory structure where
– which allows us to use a Python script to automatically download all LODES files. So using this Python script we can download
all the LODES files across all years and all states within several minutes. We also make the Python script public through
our github. So if you’re interested feel free to go to
the github link that are posted on the right – on the left side of the screen and go there. And we also provide the instructions and recommendations
of how and where to write this Python script. And secondly as I mentioned then we want to
convert the CSV files to the Parquet format. So for those who are not familiar with the
Parquet format, it’s essentially a compressed file system. It’s kind of similar to the Zip file that
we usually use. It can help us save a lot of storage space. And the Parquet format is also in columnar
storage format. It’s a special storage format which makes
the data reading and write out much faster. And now we have the Parquet version of the
data. We want to summarize this block level data
to the place and track level. So specifically first we want to merge the
LODES data with Census geographic cross looks using the Census block number. And then we will sum up the total number of
workers in each block by their Census track and place number. So as those who are familiar with the data
manipulation we know that in those two steps especially the data merging and the data aggregation
can be really computationally happy and our computer may – our local computer may ran
out of space and the memory if you want to process them locally. So really the technology that we use here
that enable us to do this kind of big data processing is called cloud computing. So the cloud computing is different from the
local computer that we usually use every day. It’s provided by the cloud computing services. Some big ones are like Amazon Web Services,
or the Microsoft Azure and the specific type of cloud computing techniques we use here
is called Cloud Clusters. So Cloud Cluster is different from the one
single computer that we usually use. It’s a group of computers and in a group of
computers there is a master node and then the master node coordinates with different
worker nodes and they work together to make the processing much faster. And some of the other advantage of the Cloud
Clusters is its augment which means we don’t need to buy the physical computers by ourselves. Whenever we want to use the Cloud Cluster
we just need to rent it online from the cloud computing providers. And once we are done using that we just need
to terminate it and we are only charged by the time that we use it. And then secondly the Cloud Cluster will reduce
the processing time significantly. For example, we had a project with our health
policy center at Urban and it took our researchers to process the data – it took us – the researchers
500 hours to process the data. And after moving everything to the cloud using
the Cloud Cluster it only took us 10 minutes to finish the whole thing. And the Cloud Cluster is also very cost effective. It only cost us $2 an hour on average and
the total cost really depends on the amount of time that we use the Cloud Cluster and
the number of worker nodes and the size of the cluster that we choose. And the Cloud Cluster is also very scalable. We can choose as many worker nodes as we want
and we are also able to spin up more than one cluster at the same time. And among all the programming languages that
our researchers are familiar with so far the Cloud Cluster is compatible with programming
language Python and R. So as I just mentioned that after we summarize
the block level LODES data to the place and track level we put it on the Urban’s data
catalogue. So if you go to datacatalogue.urban.org and
search for LODES you can find there two items will show up and if you click any one of them
you can find multiple CSV files as you can see from the screenshot on your right. So so far we have processed the residential
area characteristic and we will press area characteristic data files. And we also separate the all jobs and the
federal jobs. And right now the CSV files on the Urban’s
data catalogue are at the track and place level which means they are much smaller than
the files at the block level. So from here the researchers can easily download
both small CSV files, store them on their local computer and the reading to whatever
EX – Excel or data programs that they are familiar with and they process the data from
there. So as I just mentioned that we have this our
Cloud Cluster data but we want not only people on our data research team to be able to use
it but also the researchers at Urban can easily use that. So we set up the whole Urban’s Big Data System
which enabled the researchers easily programming using the language they are familiar with
but also taking advantage of the Cloud Cluster. So we set up the system in a way that the
researchers only need to go to an internal Website and put down their email and then
chose the number of workers they need and then submit the form. And within five to ten minutes they will get
an internal email where a link that they can click to once they put in their username and
the password and our studio insert Web interface will show up. So from here the researchers can easily process
the data using our programming language but also within our Cloud Cluster environment. So here is a simple example of how people
can take advantage of this environment. So here I’m reading a small subset of the
LODES data and we can do some easy summary starting from here. So for example, we can list all the columns
in here and then we can count the number of LODES in those data sets. So we can see that even this is a really,
really small subset of LODES data it contains over 31 million rows which is super large
and it is almost unlikely to open such a big file in our local computer. And then we can even take a look at a subsample
of this data set and get a sense of how each column looks like. And then from here the researchers can do
some custom analysis they want. For example, one of my co – workers he summarized
the LODES data to the zip level that he is interested in. And then the other co – workers he just subset
the data to a specific area he is interested in. And from here they can easily export the data
and write each CSV to their local computers. And then in terms of future work we’re definitely
going keep updating the LODES data as the New Year our new types of LODES data came
out. For example, last year when the new 2017 LODES
data came out we were really excited and immediately update the summary files on Urban’s data catalogue. And we are also thinking about including more
job types for the residential and workplace area characteristics. And we are also open to including the LODES
data – LODES origin destination data in our Urban’s data catalogue. So for the Urban’s data catalogue in general
we’re definitely very open about integrating more big data sets. Again if you’re interested feel free to go
to datacatalogue.urban.org to find the data and download the data for free that you are
interested in. All right. That’s it. Thank you so much for listening. If you have any questions feel free to ask
me. (Earlene Dowell): Vivian, this is (Earlene). There was a question that came in on the chat. And it says, would it be possible to have
a realistic example of how to use this process as a demonstration for us? Vivian Zheng: So I guess the question is about
how to use the Cloud Cluster – an example where you use a Cloud Cluster. So here if I go back to my presentation, so
right here I have a really good example about how we can read the data and then process
it. I didn’t put much details on it. But if you’re interested in it feel free to
reach out to me and I can show you a very detailed example about how to – we can process
the data through our big data system. Coordinator: If you would like to ask a question
over the phone please press star 1 and you will be prompted to record your first and
last name. Please unmute your phone when recording your
name. And to withdraw your question press star 2. One moment please. (Earlene Dowell): I have another question
Vivian that came in on the chat. Is the Urban Data Catalogue free to use and
do I need to register or log in to use it? Vivian Zheng: Yes. So Urban’s Data Catalogue is an open data
portal and it’s totally free and everyone can go there and download the data files if
they want. Coordinator: We do have some questions on
the phone. Coordinator: I believe we have (Ralph) from
Houston Texas. I apologize if that’s incorrect. (Ralph): That is correct. Coordinator: Your line is open. (Ralph): Yes. My question I just wanted to know what is
the link – where do we go to actually download the – a copy of this recording? Vivian Zheng: Sorry. So your question is where people can go to
download the summary load file, is that the question? (Ralph): Yes. Vivian Zheng: Okay. So you can go to this link here. I don’t know if you can still see the screen
but its datacatalogue.urban.org. And if you go there you don’t need to log
in or register or anything. You just need to search LODES and you can
find the LODES summary files there. Its datacatalogue.urban.org. Yes. (Ralph): Okay. Awesome. Thank you so much. Vivian Zheng: Yes. Sure. Coordinator: Our next question comes from
(Kurt). Your line is open. (Kurt): Yes. You’ve got information on where people live
and where they people work I mean could you like try to imagine everybody wants to car
pool and see how well that might work as far as reducing traffic and how many would use
it. Vivian Zheng: Yes. For sure. I think for the residential – especially for
the origin destination data from the LODES data it provides the information on where
people work and live. And from there I think there is great potential
in analyzing how the carpool would work. Actually I saw there is a research about carpool
– on this carpool topic using the LODES data on LODES official Website and you can check
out from there. But (unintelligible). (Kurt): I mean it seems if you could imagine
how well this could possibly work. Vivian Zheng: At least I think we can basically
analyze the distance between where people live and where people work and combine with
some other data sets about like how the traffic is during their commute and from there we
can do some special analysis on how the carpool would work. If you’re interested I can connect you with
some other metropolitan policy experts at Urban and we can talk – get into more details
about this topic. (Kurt): Okay. Thank you. Vivian Zheng: Thank you. Coordinator: Our next question comes from
(Ellen). Your line is open. (Ellen): Oh thank you. Yes. I’m just seeing this for the first time and
I’m trying to become a 2020 Census member and on my screen if I want to join the chat
is that enabled or is it not? Thank you. Vivian Zheng: Sorry. I can’t really hear the question very clearly. Would you mind repeating that? (Ellen): I’m sorry. Yes. I’m, trying to join this chat and I’m interested
in becoming a 2020 Census Taker. Can you hear me sufficiently? Vivian Zheng: Yes. Yes. Okay. (Ellen): I’m in the – I’m in the – yes, I’m
in the webinar and how do I access the chatter or chat presentation? Thank you. (Earlene Dowell): So ma’am, if you just refer
to the USAJobs.com that will help you for – this is entirely different or separate from… (Ellen): Yes. I was just introducing myself. I do have some experience in this area and
I would be very excited to join the webinar so I will be (unintelligible). Thank you. (Earlene Dowell): All right. Thank you for your question. (Ellen): You’re very welcome. Coordinator: Our next question comes from
(Ronald). Your line is open. (Ronald): Yes, ma’am. I wanted to know but my question has been
answered. Someone asked a question about getting the
data and so I’m going to go to data catalogue urban.org and go under LODES. So that answered my question. Thank you. Coordinator: Our next question comes from
I believe its (Gregen). Your line is open. (Gregen): Thank you. The files in CSV the question is regarding
them. Is it required to change the format to Parquet
or can they be downloaded in Excel format? Vivian Zheng: Yes. The Parquet format is optional. You can totally download directly from CSV
– you can totally download directly using the CSV file and open the CSV file from the
Excel data, Python and R and whatever programming that you’re familiar with. The reason that we convert to Parquet is that
we are going to analyze that using the Cloud Cluster so it’s a preferred format for the
Cloud Cluster system. (Gregen): Thank you. Coordinator: Once again please press star
1 to ask a question. One moment. (Earlene Dowell): So Vivian I have a couple
of questions that have come in on the chat. Vivian Zheng: Okay. (Earlene Dowell): One of the questions is
regarding the timeline to add the OD data. Vivian Zheng: So we’re planning if people
are interested in the original destination summary files we can totally add it to Urban’s
data catalogue and we can do it I think within a couple of months I think. Yes. (Earlene Dowell): Another question is, how
current is the data? Vivian Zheng: I think the data is most updated
right now. The latest 2017 data came out last August
I think and we updated it last August. Coordinator: We do have a question online
as well. (Roy), your line is open. (Roy): Thank you. I was impressed with the presentation but
I had one question. How authentic is going to be the data before
we put it into the Cloud Cluster anyway? Vivian Zheng: Sorry. Can you repeat the question? (Roy): How authentic is going to be the data
that is going to be collected? Vivian Zheng: You mean how big is the data
before we put it into Cloud Cluster? Is that the question? (Roy): No. How authentic is going to be the data and
if it’s not authentic how are you going to make it authentic to put it as authentic? Vivian Zheng: Sorry. I think you kind of break up I can’t really
hear it but feel free to shoot me an email about the question. Thank you. Sorry about that. (Roy): Okay. Coordinator: I’m showing no further question
at this time. (Earlene Dowell): All right. I have a couple of questions. One is, what are the ways people have used
these tools and data from the portal? Vivian Zheng: I think – so at Urban the researchers
have been using the LODES data – the track level LODES summary files to analyze the neighborhood
economies. And also using the LODES data to analyze the
carpool and transportation issues. If you’re interested you can go to Urban Institute
Website and to see our researchers on that. Thank you. (Earlene Dowell): Another question is can
this be used in conjunction with CTPP data? Vivian Zheng: Sorry, I’m not that familiar
with CTPP data. But I think if that data contains the geographic
information for example the Census block numbers, Census track and the place ID we can easily
merge that data set with the LODES data sets. (Earlene Dowell): There was another question
which I think you’ve already answered about how much info can I find out about my neighbor
– I think it means neighborhood – from this data? And, you know, I saw that you said you can
analyze the neighborhood’s economy and we can look at transportation and commuting patterns,
right? Vivian Zheng: Yes. And I think the data also contains information
about the employee’s age, race and other demographic information. So I think we can dig in to that part as well. (Earlene Dowell): Great. And then how large are the data gigabytes
that we can download from datacatalogue.urban.org? Vivian Zheng: I think in terms of the summary
files we can see that is – for example in the track level summary files there are 8
CSV files and each CSV files are only several hundred megabytes. So it’s totally feasible to download all the
files to a local computer. Yes. (Earlene Dowell): Another question is any
particular resources slash recommendations for tutorials for how to get started with
cloud computing with R? Vivian Zheng: Yes. I think there is a cloud computing teaching
Website – tutorial Website called a Cloud Guru and that you can go there and take the
lessons to learn how to get speed with the cloud computing services. Yes. (Earlene Dowell): Another question is, will
the data in the catalogue be the most up to date? Are the files presently available through
2015 or 2017? Vivian Zheng: Yes. The data – the summary files on data catalogue
is the most updated right now and it’s through 2002 to 2017. (Earlene Dowell): Here is another CTPP is
similar but based on a survey sample and is that the TAS level? LODES data is based on payroll data. I think that’s a statement. I’m not sure. Vivian Zheng: Yes. Again I’m not that familiar with those CTPP
data. I can do more research on that but feel free
to shoot me an email and we can talk about that in details. (Earlene Dowell): Vivian, also in the chat
they are asking about your contact information. Vivian Zheng: Okay. So my email is [email protected] [email protected] So feel free to shoot me emails and we can
talk everything more about in details. (Earlene Dowell): Operator, are there any
more questions on the lines? Coordinator: We did have some questions come
in. (Roy), your line is open. (Roy): Thank you. And I apologize earlier for the miscommunication
due to breaking up. My question is how authentic is going to be
the data collected? Vivian Zheng: So I think you’re asking about
the data collection process. (Roy): Yes. The authenticity of the data. Vivian Zheng: Authenticity? So what we do is we directly download the
LODES files, the original raw format CSV files from the LODES official Website and we do
not change anything about the raw level LODES data. So I think it’s pretty authentic. And then we keep everything as it is before
we do all the analysis. (Roy): Thank you. I appreciate that. Vivian Zheng: Yes. Thank you. (Roy): Welcome. Coordinator: Our next question comes from
(Medisha). Your line is open. (Medisha): Yes. Hi, I did not get the email. So if you could please type it so, you know,
if you need to call or email. And also I work for a city what data is available
at the city level and how do you access it? I mean, is it available at block level or
block group levels because the tracks are bigger than the cities? Vivian Zheng: Yes. I think – so one thing you can do is you can
directly go to the LODES Website and download the state based LODES files and then do some
processing from there to extract the city level information by yourself. I think at this point we can only provide
the LODES summary files by the Census track and the Census place level. But I think for the next step we’re open to
providing the city level LODES data summary files if it will help with your research and
analysis. (Medisha): Absolutely. So what do you mean by place? Vivian Zheng: I think the – so in the Census
Bureau’s geographic cross work there is a Census place ID and we basically use that
to aggregate our work data. (Medisha): Okay. In some cases it’s similar to the cities so
I guess I need to go see if we do have that level of data. Okay. Thank you. Could you please type in your email because
it wasn’t clear to me? Vivian Zheng: Yes. Sure. Okay. Sure. (Medisha): Okay. Thank you. Coordinator: I’m showing no further questions. (Earlene Dowell): Vivian could you also type
in the address for where they can learn how to use the cloud data again? Vivian Zheng: Okay. Sounds good. (Earlene Dowell): And just to answer the question
that was just previously asked about the LODES data. The LODES data goes all the way down to the
block level. Vivian Zheng: Yes. That’s right. (Earlene Dowell): All right. If there are no further questions I would
like to take this time to just say thank you to Vivian for her very informative and sophisticated
presentation. Also thank you to the audience for joining
us this afternoon. Join us next month, Wednesday, March 18 at
1:30 pm Eastern Standard Time when Andrew Foote presents Using National Jobs Data to
Measure Graduate Impacts. Until then enjoy the rest of your day.

Leave a Reply

Your email address will not be published. Required fields are marked *