Differences

This shows you the differences between two versions of the page.

--- mission:log:2014:11:24:using-python-lxml-request-as-simple-scrape-robot-for-metrics-from-webpages [2014/11/25 00:24] – [The Problem] chrono
+++ mission:log:2014:11:24:using-python-lxml-request-as-simple-scrape-robot-for-metrics-from-webpages [2016/08/09 19:13] (current) – Updated VFCC links chrono
@@ Line 22: / Line 22: @@
 ===== In the beginning there was the copy =====
-Even if it appears unique and original to us, there always was some other inspiration/model to copy from. Most of what we do is based on other ideas and concepts laid out by other people before. And their ideas also evolved in the same matter. It's basically all about perception. I could present you the final python robot and say: "This is my awesome original work". And you might believe it, since it's slick, streamlined and very efficient. But that is just the current result. You wouldn't (and in most cases won't) see how crappy it began and how it evolved into its current form. But this is exactly what we're going to do today.
+Even if it appears unique and original to us, there always was some other inspiration/model to copy from. Most of what we do is based on other ideas and concepts laid out by other people before. And their ideas also evolved in the same manner. It's basically all about perception. I could present you the final python robot and say: "This is my awesome original work". And you might believe it, since it's slick, streamlined and very efficient. But that is just the current result. You wouldn't (and in most cases won't) see how crappy it began and how it evolved into its current form. But this is exactly what we're going to do today.
 ===== The Problem =====
@@ Line 28: / Line 28: @@
 In order to start collecting long-term data to prove the [[lab:ucsspm]], reliable reference data was needed. A good industry produced pyranometer is too expensive at this time and hacking a cheap one just introduces the problem of reference data for calibration again. So I searched the net for data sources and found [[http://www.meteo.physik.uni-muenchen.de/dokuwiki/doku.php?id=wetter:stadt:messung|this site]] of the LMU.
-Unfortunately the data isn't accessible through an API or at least some JSON export of the raw data. Which meant I needed to devise a robot that would periodically scrape the data from that web page, extract all needed values and feed that data into the UCSSPM to calculate with real data for reference. Once it has done all that it has to push all usable raw data and the results of the UCSSPM prediction into an influxdb shard running on the stargazer so that the data can be stored, queried and live reviewed with the following VFCC dashboards:
+Unfortunately the data isn't accessible through an API or at least some JSON export of the raw data. Which meant I needed to devise a robot that would periodically scrape the data from that web page, extract all needed values and feed that data into the UCSSPM to calculate with real data for reference. Once it has done all that it has to push all usable raw data and the results of the UCSSPM prediction into an influxdb shard running on the stargazer so that the data can be stored, queried and (re)viewed live on the following VFCC dashboards:
-  * [[https://apollo.open-resource.org/flight-control/vfcc/#/dashboard/db/aquarius-external-environment|External Environment Data]]
+  * [[https://apollo.open-resource.org/flight-control/vfcc/dashboard/db/aquarius-external-environment|External Environment Data]]
-  * [[https://apollo.open-resource.org/flight-control/vfcc/#/dashboard/db/aquarius-solar-power|Aquarius Solar Power]]
+  * [[https://apollo.open-resource.org/flight-control/vfcc/dashboard/db/aquarius-solar-power|Aquarius Solar Power]]
-  * [[https://apollo.open-resource.org/flight-control/vfcc/#/dashboard/db/odyssey-solar-power|Odyssey Solar Power]]
+  * [[https://apollo.open-resource.org/flight-control/vfcc/dashboard/db/odyssey-solar-power|Odyssey Solar Power]]
 ===== The bash solution =====
@@ Line 97: / Line 97: @@
 Infrequently upstream data changed and introduced some incomprehensible white space changes as a consequence and sometimes just delivered 999.9 values. Pain to maintain. And since most relevant values came as floats there was no other solution than to use bc for floating point math & comparisons, since bash can't do it.
-And finally, the data structure and shipping method to influxdb is more than questionable, it would never scale. Each metric produces another new HTTP request creating a lot of wasteful overhead. But at the point of writing I simply didn't know enough to make it better.
+And finally, the data structure and shipping method to influxdb is more than questionable, it would never scale. Each metric produces another new HTTP request creating a lot of wasteful overhead. But at the point of writing I simply didn't knew enough to make it better.
 ===== The python solution =====
-Seeing the bash script fail regularly and having to look after it all the time was no option. So I looked at countless scraping examples using python. After installing and uninstalling a lot of pip packages like beautifulsoup4, scrapy and all other tools you can find when searching for python web scraping, I couldn't get anything to work with python. So i broke it down to the most simple tasks and went step by step.
+Seeing the bash script fail regularly and having to look after it all the time was no option. This robot needed more features and capabilties, first of all, trying to stay alive no matter what kind of beating it gets and a way more sophisticated approach to handle and evaluate the data it's responsible for. That reached the limit of doing this sensibly in bash.
-  * Reduce the amount of data to transfer and parse
+When I reach that conclusion, I usually turn to python and started by looking at countless scraping examples in python. So I installed and uninstalled a lot of pip packages like beautifulsoup4, scrapy and the countless other tools you can find when searching for python web scraping. But I couldn't get anything to work with them when I copied the code and tried to adapt it to my own use case. I decided to step back, reconsider what I've learned (by copying and failing to adapt it) and to break it down to single tasks and go step by step from scratch.
-After searching Dokuwiki's docs I found a nice feature for that: doku.php?do=export_xhtmlbody delivers the page content only. This reduces the amount of traffic and also the risk of changes which might break the scraper again in the future.
+**1. Reduce the amount of data to transfer and parse**
+After searching Dokuwiki's docs I discovered a nice feature for that: **doku.php?do=export_xhtmlbody** delivers the page content only. This alone reduces the amount of traffic at least by 30% and also the risk of changes, which might break the scraper again in the future.
-  * Try to find a way to look into specific HTML elements only
+**2. Try to find a structured way to look into specific HTML elements only**
+After looking at lxml examples again it seemed feasible to extract just TD elements and in this case all data was wrapped inside TD elements, so after a bit of testing, this worked pretty well.
+**3. Increase resilience: Have a reliable regular expression to extract all numbers (signed/unsigned int and float) and have EVERY input sanity checked and cast into its designated type**
+Well, stackexchange is full of examples for regular expressions to copy and http://www.regexr.com/ offers a nice live test for it. Combined all this into flextract.
+**4. Learn more about influxdb to restructure the data to reduce the amount of timeseries**
+This came almost naturally after looking at so many other examples of metric data structures, I simply copied and merged what I considered best practice.
-After looking at lxml examples again it seemed feasible to extract just TD elements and in this case all data was wrapped inside TD elements, so after a bit of testing this worked pretty well.
+**5. Figure out a way to push a complete dataset in one http post request to reduce overhead**
-  * Have a reliable regular expression to extract all numbers (signed/unsigned int and float)
+Brute forcing the correct data format needed with another shell script feeding curl until I was able to figure out the sequence, since there was nothing in the docs about the structure of requests with multiple timeseries. Influxdb is rather picky about strings and quotes so it took a little while to figure out how to do it with curl and then to build and escape the structure correctly in python. Played around with append() and join() and really started to appreciate them.
-Well, stackexchange is full of examples to copy and http://www.regexr.com/ offers a nice live test
+**6. Increase resilience: No single step exception should kill the robot (salvation)**
-  * Have EVERY input sanity checked and cast into its designated type
+Well, python let's you try and pass, to fail and fallback very gracefully :)
-  * Learn more about influxdb to restructure the data to reduce the amount of timeseries
-  * Figure out a way to push a complete dataset in one http post request to reduce overhead
-  * Have no single exception kill the rest of the script (robot's salvation)
 <sxh python; toolbar:false>
@@ Line 149: / Line 158: @@
 			return WORK
+                else:
+                        return fallback
 	except:
@@ Line 275: / Line 285: @@
 </sxh>
-And that's that. Success. The only thing left to do, in order to close the circle again, was to share this knowledge, so that the next person looking for ways to scrape data from web pages with python can copy these examples, adapt them according to the new use case and fail and learn and come up with new ideas as well. Hopefully in even less time. And it also made it pretty obvious that the UCSSPM code has to be refactored again, so that it can be included as a python lib in order to get rid of the system call and all the input/output piping :)
+And that's that. Success. The only thing left to do, in order to close the circle again, was to share this knowledge, so that the next person looking for ways to scrape data from web pages with python can copy these examples, adapt them according to the new use case and fail and learn and come up with new ideas as well. Hopefully in even less time. And it also made it pretty obvious that the [[lab:ucsspm|UCSSPM]] code has to be refactored again, so that it can be included as a python lib in order to get rid of the system call and all the input/output piping :)
+You can see the results of this robot's actions in the **[[https://apollo.open-resource.org/flight-control/vfcc/|Virtual Flight Control Center (VFCC)]]**
+And of course it goes without saying that this also serves to show pretty well how important learning computer languages will become. We cannot create a army of slaves to do our bidding (for that is what all these machines/computers/systems like smartphones, IoT devices, automatons really are) if we don't know how to command them. Our current technological state is only possible because we already give an essential part of our workload to machines.
+But how do we expect people to be able to tell all these machines what and how exactly they're supposed to do something (training a new slave/servant) if we're not willing to speak their language? It will still take some time until we've reached a state where we have more generalized systems or the first beginnings of real (buzzword alert) artificial intelligence. Up to here it's just people programming states and reactions to these states in smart and creative fashion but we still have to do it in their way. So why still force people to involuntary learn dead stuff like latin or french when the future for all of us lies in computers & programming languages?
 {{tag>software ucsspm python data scraping metrics influxdb vfcc research}}

User Tools

Site Tools

Differences

Page Tools