Dhiraj Rokade: July 2018

Tuesday, 31 July 2018

Geek Trivia: The Term “Cliffhanger”, To Refer To A Suspenseful But Unresolved Ending Originated With?

Think you know the answer? Click through to see if you're right!

Nintendo Has Sold Nearly 20 Million Switches In a Year And a Half

You Can Now Quickly Add Reaction GIFs in Gmail

Ever wish you could respond to an email with a GIF, like you can in messaging apps? With Gmail now you can.

The Endless Cycle: Websites Keep Getting Heavier as Internet Speeds Get Better

Have you noticed the internet sometimes feels slower than ever, even as broadband speeds get faster?

The Best Cordless Power Tool Systems for Every Skill Level and Budget

The cordless tool market has improved greatly over the years, but perhaps the greatest advancement has been the sheer variety.

Click Here to Continue Reading

Processing Petabytes of Data in Seconds with Databricks Delta

Introduction

Databricks Delta is a unified data management system that brings data reliability and fast analytics to cloud data lakes. In this blog post, we take a peek under the hood to examine what makes Databricks Delta capable of sifting through petabytes of data within seconds. In particular, we discuss Data Skipping and ZORDER Clustering.

These two features combined enable the Databricks Runtime to dramatically reduce the amount of data that needs to be scanned in order to answer highly selective queries against large Delta tables, which typically translates into orders-of-magnitude runtime improvements and cost savings.

You can see these features in action in a keynote speech from the 2018 Spark + AI Summit, where Apple’s Dominique Brezinski demonstrated their use case for Databricks Delta as a unified solution for data engineering and data science in the context of cyber-security monitoring and threat response.

How to Use Data Skipping and ZORDER Clustering

To take advantage of data skipping, all you need to do is use Databricks Delta. The feature is automatic and kicks in whenever your SQL queries or Dataset operations include filters of the form “column op literal”, where:

column is an attribute of some Databricks Delta table, be it top-level or nested, whose data type is string / numeric / date/timestamp
op is a binary comparison operator, StartsWith / LIKE ‘pattern%’, or IN <list_of_values>
literal is an explicit (list of) value(s) of the same data type as a column

AND / OR / NOT are also supported, as well as “literal op column” predicates.

As we’ll explain below, even though data skipping always kicks in when the above conditions are met, it may not always be very effective. But, if there are a few columns that you frequently filter by and want to make sure that’s fast, then you can explicitly optimize your data layout with respect to skipping effectiveness by running the following command:

OPTIMIZE <table> [WHERE <partition_filter>]
ZORDER BY (<column>[, …])

More on this later. First, let’s take a step back and put things in context.

How Data Skipping and ZORDER Clustering Work

The general use-case for these features is to improve the performance of needle-in-the-haystack kind of queries against huge data sets. The typical RDBMS solution, namely secondary indexes, is not practical in a big data context due to scalability reasons.

If you’re familiar with big data systems (be it Apache Spark, Hive, Impala, Vertica, etc.), you might already be thinking: (horizontal) partitioning.

Quick reminder: In Spark, just like Hive, partitioning ¹ works by having one subdirectory for every distinct value of the partition column(s). Queries with filters on the partition column(s) can then benefit from partition pruning, i.e., avoid scanning any partition that doesn’t satisfy those filters.

The main question is: What columns do you partition by?
And the typical answer is: The ones you’re most likely to filter by in time-sensitive queries.
But… What if there are multiple (say 4+), equally relevant columns?

The problem, in that case, is that you end up with a huge number of unique combinations of values, which means a huge number of partitions and therefore files. Having data split across many small files brings up the following main issues:

Metadata becomes as large as the data itself, causing performance issues for various driver-side operations
In particular, file listing is affected, becoming very slow
Compression effectiveness is compromised, leading to wasted space and slower IO

So while data partitioning in Spark generally works great for dates or categorical columns, it is not well suited for high-cardinality columns and, in practice, it is usually limited to one or two columns at most.

Data Skipping

Apart from partition pruning, another common technique that’s used in the data warehousing world, but which Spark currently lacks, is I/O pruning based on Small Materialized Aggregates. In short, the idea is to:

Keep track of simple statistics such as minimum and maximum values at a certain granularity that’s correlated with I/O granularity.
Leverage those statistics at query planning time in order to avoid unnecessary I/O.

This is exactly what Databricks Delta’s data skipping feature is about. As new data is inserted into a Databricks Delta table, file-level min/max statistics are collected for all columns (including nested ones) of supported types. Then, when there’s a lookup query against the table, Databricks Delta first consults these statistics in order to determine which files can safely be skipped. But, as they say, a GIF is worth a thousand words, so here you go:

On the one hand, this is a lightweight and flexible (the granularity can be tuned) technique that is easy to implement and reason about. It’s also completely orthogonal to partitioning: it works great alongside it, but doesn’t depend on it. On the other hand, it’s a probabilistic indexing approach which, like bloom filters, may give false-positives, especially when data is not clustered. Which brings us to our next technique.

ZORDER Clustering

For I/O pruning to be effective data needs to be clustered so that min-max ranges are narrow and, ideally, non-overlapping. That way, for a given point lookup, the number of min-max range hits is minimized, i.e. skipping is maximized.

Sometimes, data just happens to be naturally clustered: monotonically increasing IDs, columns that are correlated with insertion time (e.g., dates / timestamps) or the partition key (e.g., pk_brand_name – model_name). When that’s not the case, you can still enforce clustering by explicitly sorting or range-partitioning your data before insertions.

But, again, suppose your workload consists of equally frequent/relevant single-column predicates on (e.g. n = 4) different columns.

In that case, “linear” a.k.a. “lexicographic” or “major-minor” sorting by all of the n columns will strongly favor the first one that’s specified, clustering its values perfectly. However, it won’t do much, if anything at all (depending on how many duplicate values there are on that first column) for the second one, and so on. Therefore, in all likelihood, there will be no clustering on the nth column and therefore no skipping possible for lookups involving it.

So how can we do better? More precisely, how can we achieve similar skipping effectiveness along every individual dimension?

If we think about it, what we’re looking for is a way of assigning n-dimensional data points to data files, such that points assigned to the same file are also close to each other along each of the n dimensions individually. In other words, we want to map multi-dimensional points to one-dimensional values in a way that preserves locality.

This is a well-known problem, encountered not only in the database world, but also in domains such as computer graphics and geohashing. The answer is: locality-preserving space-filling curves, the most commonly used ones being the Z-order and Hilbert curves.

Below is a simple illustration of how Z-ordering can be applied for improving data layout with regard to data skipping effectiveness. Legend:

Gray dot = data point e.g., chessboard square coordinates
Gray box = data file; in this example, we aim for files of 4 points each
Yellow Yellow box = data file that’s read for the given query
Green dot = data point that passes the query’s filter and answers the query
Red dot = data point that’s read, but doesn’t satisfy the filter; “false positive”

An Example in Cybersecurity Analysis

Okay, enough theory, let’s get back to the Spark + AI Summit keynote and see how Databricks Delta can be used for real-time cybersecurity threat response.

Say you’re using Bro, the popular open-source network traffic analyzer, which produces real-time, comprehensive network activity information². The more popular your product is, the more heavily your services get used and, therefore, the more data Bro starts outputting. Writing this data at a fast enough pace to persistent storage in a more structured way for future processing is the first big data challenge you’ll face.

This is exactly what Databricks Delta was designed for in the first place, making this task easy and reliable. What you could do is use structured streaming to pipe your Bro conn data into a date-partitioned Databricks Delta table, which you’ll periodically run OPTIMIZE on so that your log records end up evenly distributed across reasonably-sized data files. But that’s not the focus of this blog post, so, for illustration purposes, let’s keep it simple and use a non-streaming, non-partitioned Databricks Delta table consisting of uniformly distributed random data.

Faced with a potential cyber-attack threat, the kind of ad-hoc data analysis you’ll want to run is a series of interactive “point lookups” against the logged network connection data. For example, “find all recent network activity involving this suspicious IP address.” We’ll model this workload by assuming it’s made out of basic lookup queries with single-column equality filters, using both random and sampled IPs and ports. Such simple queries are IO-bound, i.e. their runtime depends linearly on the amount of data scanned.

These lookup queries will typically turn into full table scans that might run for hours, depending on how much data you’re storing and how far back you’re looking. Your end goal is likely to minimize the total amount of time spent on running these queries, but, for illustration purposes, let’s instead define our cost function as the total number of records scanned. This metric should be a good approximation of total runtime and has the benefit of being well defined and deterministic, allowing interested readers to easily and reliably reproduce our experiments.

So here we go, this is what we’ll work with, concretely:

case class ConnRecord(src_ip: String, src_port: Int, dst_ip: String, dst_port: Int)

def randomIPv4(r: Random) = Seq.fill(4)(r.nextInt(256)).mkString(".")
def randomPort(r: Random) = r.nextInt(65536)

def randomConnRecord(r: Random) = ConnRecord(
   src_ip = randomIPv4(r), src_port = randomPort(r),
   dst_ip = randomIPv4(r), dst_port = randomPort(r))

case class TestResult(numFilesScanned: Long, numRowsScanned: Long, numRowsReturned: Long)

def testFilter(table: String, filter: String): TestResult = {
   val query = s"SELECT COUNT(*) FROM $table WHERE $filter"

   val(result, metrics) = collectWithScanMetrics(sql(query).as[Long])
   TestResult(
      numFilesScanned = metrics("filesNum"),
      numRowsScanned = metrics.get("numOutputRows").getOrElse(0L),
      numRowsReturned = result.head)
}

// Runs testFilter() on all given filters and returns the percent of rows skipped
// on average, as a proxy for Data Skipping effectiveness: 0 is bad, 1 is good
def skippingEffectiveness(table: String, filters: Seq[String]): Double = { ... }

Here’s how a randomly generated table of 100 files, 1K random records each, might look like:

  SELECT row_number() OVER (ORDER BY file) AS file_id,
       count(*) as numRecords, min(src_ip), max(src_ip), min(src_port), 
       max(src_port), min(dst_ip), max(dst_ip), min(dst_port), max(dst_port)
  FROM (
  SELECT input_file_name() AS file, * FROM conn_random)
  GROUP BY file

Seeing how every file’s min-max ranges cover almost the entire domain of values, it is easy to predict that there will be very little opportunity for file skipping. Our evaluation function confirms that:

skippingEffectiveness(connRandom, singleColumnFilters)

Ok, that’s expected, as our data is randomly generated and so there are no correlations. So let’s try explicitly sorting data before writing it.

spark.read.table(connRandom)
     .repartitionByRange($"src_ip", $"src_port", $"dst_ip", $"dst_port")
     // or just .sort($"src_ip", $"src_port", $"dst_ip", $"dst_port")
     .write.format("delta").saveAsTable(connSorted)

skippingEffectiveness(connRandom, singleColumnFilters)

Hmm, we have indeed improved our metric, but 25% is still not great. Let’s take a closer look:

val src_ip_eff = skippingEffectiveness(connSorted, srcIPv4Filters)
val src_port_eff = skippingEffectiveness(connSorted, srcPortFilters)
val dst_ip_eff = skippingEffectiveness(connSorted, dstIPv4Filters)
val dst_port_eff = skippingEffectiveness(connSorted, dstPortFilters)

Turns out src_ip lookups are really fast but all others are basically just full table scans. Again, that’s no surprise. As explained earlier, that’s what you get with linear sorting: the resulting data is clustered perfectly along the first dimension (src_ip in our case), but almost not at all along further dimensions.

So how can we do better? By enforcing ZORDER clustering.

spark.read.table(connRandom)
     .write.format("delta").saveAsTable(connZorder)

sql(s"OPTIMIZE $connZorder ZORDER BY (src_ip, src_port, dst_ip, dst_port)")

skippingEffectiveness(connZorder, singleColumnFilters)

Quite a bit better than the 0.25 obtained by linear sorting, right? Also, here’s the breakdown:

val src_ip_eff = skippingEffectiveness(connZorder, srcIPv4Filters)
val src_port_eff = skippingEffectiveness(connZorder, srcPortFilters)
val dst_ip_eff = skippingEffectiveness(connZorder, dstIPv4Filters)
val dst_port_eff = skippingEffectiveness(connZorder, dstPortFilters)

A couple of observations worth noting:

It is expected that skipping effectiveness on src_ip is now lower than with linear ordering, as the latter would ensure perfect clustering, unlike z-ordering. However, the other columns’ score is now almost just as good, unlike before when it was 0.
It is also expected that the more columns you z-order by, the lower the effectiveness.
For example, ZORDER BY (src_ip, dst_ip) achieves 0.82. So it is up to you to decide what filters you care about the most.

In the real-world use case presented at the Spark + AI summit, the skipping effectiveness on a typical WHERE src_ip = x AND dst_ip = y query was even higher. In a data set of 504 terabytes (over 11 trillion rows), only 36.5 terabytes needed to be scanned thanks to data skipping. That’s a significant reduction of 92.4% in the number of bytes and **93.2% **in the number of rows.

Conclusion

Using Databricks Delta’s built-in data skipping and ZORDER clustering features, large cloud data lakes can be queried in a matter of seconds by skipping files not relevant to the query. In a real-world cybersecurity analysis use case, 93.2% of the records in a 504 terabytes dataset were skipped for a typical query, reducing query times by up to two orders of magnitude.

In other words, Databricks Delta can speed up your queries by as much as 100X.

Note: Data skipping has been offered as an independent option outside of Databricks Delta in the past as a separate preview. That option will be deprecated in the near future. We highly recommend you move to Databricks Delta to take advantage of the data skipping capability.

Here are some assets for you:

Databricks Delta Product Page
Databricks Delta User Guide AWS or Azure
Databricks Engineering Blog Post on Databricks Delta

To be clear, here we mean write.partitionBy(), not to be confused with RDD partitions. ↩
To get an idea of what that looks like, check out the sample Bro data that’s kindly hosted by www.secrepo.com. ↩

Try Databricks for free. Get started today.

The post Processing Petabytes of Data in Seconds with Databricks Delta appeared first on Databricks.

How to Add a Background Color, Picture, or Texture to a Word Document

You can quickly add visual appeal to your Microsoft Word document by adding a background color, image or texture. You can choose from a variety of colors and fill effects. Adding a colorful background image can be helpful when creating a brochure, presentation, or marketing materials.

Five Android Features Samsung Does Better Than Google

Ask any Android purist and they’ll tell you: stock Android is the one true Android. But objectively, it’s not perfect, and there are things that Samsung devices do better than any stock Android device out there—even Google’s own devices.

The Best Gifts For Your Geeky Husband, Boyfriend, or Son

Geeks are hard to buy for—especially if you aren’t one yourself.

Click Here to Continue Reading

How to Disable the Articles on Chrome’s New Tab Page for Android and iPhone

Google Chrome for Android, iPhone, and iPad shows “suggested articles” from the web on its New Tab page. You can hide those if you’d rather clean up your New Tab page and avoid the distractions.

Looking towards the Future with Business Analytics

This blog post is part of the Big Data Week Speaker interviews series. We sat down with data strategist and tech leader, Jen Stirrup, discussing the key differences between BI and BA, the impact of AI on these respective fields, as well as her keynote at Big Data Week London Conference. 1. In recent years, organizations have increasingly turned to advanced software solutions to manage workloads, maintain profitability and ensure competitiveness within their respective industries. Business intelligence (BI) software and business analytics (BA) programs are arguably the most widely implemented data management solutions. So, what the key differences are between business intelligence vs. business analytics? In the world of data, we have crossed the Rubicon which now includes business analytics as well as [...] Continue Reading

How Water Damages Electronics

It’s common knowledge that water does bad things to electronics, but there are still a few things you may not know about how exactly water can damage electronic components and what you can do if you ever accidentally take your devices for a swim.

Introducing Jenkins Cloud Native SIG

On large-scale Jenkins instances master disk and network I/O become bottlenecks in particular cases. Build logging and artifact storage were one for the most intensive I/O consumers, hence it would be great to somehow redirect them to an external storage. Back in 2016 there were active discussions about such Pluggable Storage for Jenkins. At that point we created several prototypes, but then other work took precedence. There was still a high demand in Pluggable Storage for large-scale instances, and these stories also become a major obstacle for cloud native Jenkins setups.

I am happy to say that the Pluggable Storage discussions are back online. You may have seen changes in the Core for Artifact Storage (JEP-202) and a new Artifact Manager for S3 plugin. We have also created a number of JEPs for External Logging and created a new Cloud Native Special Interest Group (SIG) to offer a venue for discussing changes and to keep them as open as possible.

Tomorrow Jesse Glick and I will be presenting the current External Logging designs at the Cloud Native SIG online meeting, you can find more info about the meeting here. I decided that it is a good time to write about the new SIG. In this blogpost I will try to provide my vision of the SIG and its purpose. I will also summarize the current status of the activities in the group.

What are Special Interest Groups?

If you follow the developer mailing list, you may have seen the discussion about introducing SIGs in the Jenkins project. The SIG model has been proposed by R. Tyler Croy, and it largely follows the successful Kubernetes SIG model. The objective of these SIGs is to make the community more transparent to contributors and to offer venues for specific discussions. The idea of SIGs and how to create them is documented in JEP-4. JEP-4 is still in Draft state, but a few SIGs have been already created using that process: Platform SIG, GSoC SIG and, finally, Cloud Native SIG.

SIGs are a big opportunity to the Jenkins project, offering a new way to onboard contributors who are interested only in particular aspects of Jenkins. With SIGs they can subscribe to particular topics without following the entire Developer mailing list which can become pretty buzzy nowadays. It also offers company contributors a clear way how to join community and participate in specific areas. This is great for larger projects which cannot be done by a single contributor. Like JEPs, SIGs help focus and coordinate efforts.

And, back to major efforts… Lack of resources among core contributors was one of the reasons why we did not deliver on Pluggable Storage stories back in 2016. I believe that SIGs can help fix that in Jenkins, making it easier to find groups with the same interests and reach out to them in order to organize activity. Regular meetings are also helpful to get such efforts moving.

Points above are the main reasons why I joined the Cloud Native SIG. Similarly, that’s why I decided to create a Platform SIG to deliver on major efforts like Java 10+ support in Jenkins. I hope that more SIGs get created soon so that contributors could focus on areas of their interest.

Cloud Native SIG

In the original proposal Carlos Sanchez, the Cloud Native SIG chair, has described the purpose of the SIG well. There has been great progress this year in cloud-native-minded projects like Jenkins X and Jenkins Essentials, but the current Jenkins architecture does not offer particular features which could be utilized there: Pluggable Storage, High Availability, etc. There are ways to achieve it using Jenkins plugins and some infrastructure tweaks, but it is far from the out-of-the-box experience. It complicates Jenkins management and slows down development of new cloud-native solutions for Jenkins.

So, what do I expect from the SIG?

Define roadmap towards Cloud-Native Jenkins architecture which will help the project to stay relevant for Cloud Native installations
Provide a venue for discussion of critical Jenkins architecture changes
Act as a steering committee for Jenkins Enhancement Proposals in the area of Cloud-Native solutions
Finally, coordinate efforts between contributors and get new contributors onboard

What’s next in the SIG?

The SIG agenda is largely defined by the SIG participants. If you are interested to discuss particular topics, just propose them in the SIG mailing list. As the current SIG page describes, there are several areas defined as initial topics: Artifact Storage, Log Storage, Configuration Storage

All these topics are related to the Pluggable Storage Area, and the end goal for them is to ensure that all data is externalized so that replication becomes possible. In addition to the mentioned data types, discussed at the Jenkins World 2016 summit, we will need to externalize other data types: Item and Run storage, Fingerprints, Test and coverage results, etc. There is some foundation work being done for that. For example, Shenyu Zheng is working on a Code Coverage API plugin which would allow to unify the code coverage storage formats in Jenkins.

Once the Pluggable Storage stories are done the next steps are true High Availability, rolling or canary upgrades and zero downtime. At that point other foundation stories like Remoting over Kafka by Pham Vu Tuan might be integrated into the Cloud Native architecture to make Jenkins more robust against outages within the cluster. It will take some time to get to this state, but it can be done incrementally.

Let me briefly summarize current state of the 3 focuses listed in the Cloud Native SIG.

Artifact Storage

There are many existing plugins allowing to upload and download artifacts from external storage (e.g. S3, Artifactory, Publish over SFTP, etc., etc.), but there are no plugins which can do it transparently without using new steps. In many cases the artifacts also get uploaded through master, and it increases load on the system. It would be great if there was a layer which would allow storing artifacts externally when using common steps like Archive Artifacts.

Artifact storage work was started this spring by Jesse Glick, Carlos Sanchez and Ivan Fernandez Calvo before the Cloud Native SIG was actually founded. Current state:

JEP-202 "External Artifact Storage" has been proposed in the Jenkins community. This JEP defines API changes in the Jenkins core which are needed to support External artifact managers
Jenkins Pipeline has been updated to support external artifact storages for archive/unarchive and stash/unstash
New Artifact Manager for S3 plugin reference implementation of the new API. The plugin is available in main Jenkins update centers
A number of plugins has been updated in order to support external artifact storage

The Artifact Manager API is available in Jenkins LTS starting from 2.121.1, so it is possible to create new implementations using the provided API and existing implementations. This new feature is fully backward compatible with the default Filesystem-based storage, but there are known issues for plugins explicitly relying on artifact locations in JENKINS_HOME (you can find a list of such plugins here). It will take a while to get all plugins supported, but the new API in the core should allow migrating plugins.

I hope we will revisit the External Artifact Storage at the SIG meetings at some point. It would be a good opportunity to do a retrospective and to understand how to improve the process in SIG.

Log storage

Log storage is a separate big story. Back in 2016 External logging was one of the key Pluggable Storage stories we defined at the contributor summit. We created an EPIC for the story (JENKINS-38313) and after that created a number of prototypes together with Xing Yan and Jesse Glick. One of these prototypes for Pipeline has recently been updated and published here.

Jesse Glick and Carlos Sanchez are returning to this story and plan to discuss it within the Cloud Native SIG. There are a number of Jenkins Enhancement proposals which have been submitted recently:

JEP-207 - External Build Logging support in the Jenkins Core
JEP-210 - External log storage for Pipeline
Draft JEP - External Logging API Plugin
JEP-206 - Use UTF-8 for Pipeline build logs

In the linked documents you can find references to current reference implementations. So far we have a working prototype for the new design. There are still many bits to fix before the final release, but the designs are ready for review and feedback.

This Tuesday (Jul 31) we are going to have a SIG meeting in order to present the current state and to discuss the proposed designs and JEPs. The meeting will happen at 3PM UTC. You can watch the broadcast using this link. Participant link will be posted in the SIGs Gitter channel 10 minutes before the meeting.

Configuration storage

This is one of the future stories we would like to consider. Although configurations are not big, externalizing them is a critical task for getting highly-available or disposable Jenkins masters. There are many ways to store configurations in Jenkins, but 95% of cases are covered by the XmlFile layer which serializes objects to disk and reads them using the XStream library. Externalizing these XmlFiles would be a great step forward.

There are several prototypes for externalizing configurations, e.g. in DotCI. There are also other implementations which could be upstreamed to the Jenkins core:

Alex Nordlund has recently proposed a pull request to Jenkins Core, which should make the XML Storage pluggable
James Strachan has implemented similar engine for Kubernetes in the kubeify prototype
I also did some experiments with externalizing XML Storages back in 2016

The next steps for this story would be to aggregate implementations into a single JEP. I have it in my queue, and I hope to write up a design once we get more clarity on the External logging stories.

Conclusions

Special Interest Groups are a new format for collaboration and disucssion in the Jenkins community. Although we have had some work groups before (Infrastructure, Configuration-as-Code, etc.), introduction of SIGs sets a new bar in terms of the project transparency and consistency. Major architecture changes in Jenkins are needed to ensure its future in the new environments, and SIGs will help to boost visibility and participation around these changes.

If you want to know more about the Cloud Native SIG, all resources are listed on the SIG’s page on jenkins.io. If you want to participate in the SIG’s activities, just do the following:

Subscribe to the mailing list
Join our Gitter channel
Join our public meetings

I am also working on organizing a face-to-face Cloud Native SIG meeting at the Jenkins Contributor Summit, which will happen on September 17 during DevOps World | Jenkins World in San Francisco. If you come to DevOps World | Jenkins World, please feel free to join us at the contributor summit or to meet us at the community booth. Together with Jesse and Carlos we are also going to present some bits of our work at the A Cloud Native Jenkins talk.

Stay tuned for more updates and demos on the Cloud-Native Jenkins fronts!

Anker Soundcore Space NC Headphones Review: An Ideal Budget Pick

Premium noise cancelling headphones are pretty expensive, but the Anker Soundcore Space NC sets out to prove they don’t have to be.

Click Here to Continue Reading

Monday, 30 July 2018

Geek Trivia: Which One Of These Common Food Crops Is Deadly If Stored Incorrectly?

Think you know the answer? Click through to see if you're right!

YouTube Surrenders, Removes Black Bars From Vertical Videos

Google is officially done trying to teach all of you to shoot your videos properly, so they’re going to make vertical videos look slightly better.

New Ecobee Feature Will Adjust Your Thermostat When Energy Rates Get Too High

Ecobee is rolling out a new feature to some users that will automatically tweak your thermostat when energy rates get too high.

The new…

Click Here to Continue Reading

The Best Compact Mobile Keyboards For Typing On The Go

So you want to start chipping away at that screenplay in your local coffee shop, but lugging your laptop along isn’t ideal.

Click Here to Continue Reading

How to Recover Your Forgotten WhatsApp PIN

Although WhatsApp lacks a password for signing in to your account, it does, however, have a two-step verification to keep anyone from gaining access to your account if they steal your SIM card. Here’s how you’re able to reset that 6-digit PIN if you’ve forgotten it.

Will MoviePass Please Die So People Stop Complaining About It

For months now, the tech nerd community won’t stop talking about MoviePass, and I’m sick and tired of hearing about it. Can MoviePass just die already?

How to Roll Back to iOS 11 (If You’re Using the iOS 12 Beta)

So you’ve installed the iOS 12 beta and, well, you’re experiencing some bugs. That’s okay, because you can quickly downgrade to the stable iOS 11.4.1.

How the Digital Ecosystem Will Revolutionize Business

The digital ecosystem is changing everything. Companies that don't adapt to this reality risk missing out on an exciting new way of doing business.

The post How the Digital Ecosystem Will Revolutionize Business appeared first on Hortonworks.

The Best Video Editing Tools for Chromebooks

While it’s commonly thought Chromebooks aren’t good for anything more than surfing the web, that’s not the case. If you need to do some video editing from your Chromebook, there are definitely some tools out there that can do the job.

The Best Budget Friendly Retro Watches

We’re huge fans of automatic watches here at Review Geek b…

Click Here to Continue Reading

How India's software skills are helping its hardware startups to rise again

Rather than chase the high-scale, high-cost consumer market, startups are striking out in a different direction.

How to Change a Picture to Black and White in Microsoft Word

Microsoft Word has several simple color adjustment options so that you can quickly and easily style the images in your Word document. Whether you want to create an artistic masterpiece or simply get your document to match your printer’s settings, you can change your images to black and white in Word.

How to Enable Windows Defender’s Secret Crapware Blocker

Windows 10’s antivirus does a good job overall, but it lets crapware through. A hidden setting intended for organizations will boost Windows Defender’s security, making it block adware, potentially unwanted programs, PUPs, or whatever you want to call this junk.

Sunday, 29 July 2018

Geek Trivia: Which Of These Used To Be An Official Olympic Sport?

Think you know the answer? Click through to see if you're right!

The Best Manual Coffee Grinders For Delicious And Consistent Flavor

Fresh ground coffee tastes better, but really high quality grinders are expensive. What’s a coffee aficionado on a budget to do?

Click Here to Continue Reading

Cord Cutting Isn’t Just About Money: Streaming Services Are Better Than Cable

Cord cutting is picking up steam. Forecasters are predicting a 33 percent increase in people dropping their cable subscription this year over last—faster than analysts predicted.

Saturday, 28 July 2018

Geek Trivia: How Fast Was The CPU In The Original IBM PC?

Think you know the answer? Click through to see if you're right!

The Best Running Watch For Every Budget

If you’re running regularly, it’s useful to be able to track your progress, pace, elevation, and route.

Click Here to Continue Reading

Forget Voice Control, Automation Is the Real Smarthome Superpower

Too often I find myself taking way too much time to figure out easier ways to control my smarthome devices, but I think many of us (including myself) forget that there’s a better way: not having to control them at all.

Facial Recognition Software Could Help Women Find Egg Donors Who Look Like Them

Every year, thousands of women struggling with fertility issues use egg donors so they can have babies of their own. However, getting a child who looks like them is generally left up to the doctor’s judgement and fate. Until now, that is.

What Is a Log File (and How Do I Open One)?

Sometimes the best way to troubleshoot and operating system, application, or service is to consult the log file(s) that app or service generates as it goes about its business. But what is a LOG file and how do you see what’s in it?

Friday, 27 July 2018

MoviePass Had an Outage Because It Ran Out Money, So Maybe Use Yours Quick

MoviePass currently has a business model of burning money. Turns out they burned a bit too much. On Thursday night, the service experienced an “outage” because the company ran out of cash.

What the Hell Does Valve Even Do Anymore (Besides Take Our Money)

As Steam unveils an overhaul to its chat system that barely competes with the rising star of Discord, we’re left to wonder why this took so long to begin with. What the heck does Valve even do anymore?

The Best Automatic Dog Food Dispensers

Automated dog food dispensers won’t just make your life easier, they can also improve yo…

Click Here to Continue Reading

Are You Freaking Kidding Me? Nintendo’s Newest Labo Kit Has a Car, Plane, and Submarine

How to Type Accent Marks Over Letters in Microsoft Word

If you don’t have a specialized keyboard, you have to do a little extra work to type letters with accent marks in Microsoft Word. Here are a few ways to get it done.

You Can Now Schedule Custom Routines in Google Home

A few months ago, Google added a feature to Google Assistant called Custom Routines that allows users to create strings of commands that can be executed with just a single phrase. Starting today, you can schedule those commands.

The Best Photo Editors for Chromebooks

One of the biggest question we see about Chromebooks is “can they run Photoshop?” The answer to that is no—at least not the full version you’ll find on other platforms. But that doesn’t mean they can’t do photo editing.

The Best Laptop Bags Under $40

So you just bought that dream laptop. Super-powerful. Amazingly thin.

Click Here to Continue Reading

What’s the Difference Between an Optical and Electronic Viewfinder?

The viewfinder is the bit you look through on your camera when you’re taking a picture. There are two main kinds: optical viewfinders and electronic viewfinders, although live view screens are also used. Let’s look at the differences between them.

How to Enable Dark Mode for Gmail

Microsoft just added a dark mode to Outlook.com, but Google is way ahead of Microsoft. Gmail has had a dark theme for years. Here’s how to enable it—and get a better, more complete dark mode for Gmail, too.

A note to India Inc on thinking through its Blockchain strategy

Most transformational technology cycles in early days are marked by the simultaneous existence of hyperbolic exuberance as well as doomsday predictions about its future.

Bloom Energy raises $270M in NYSE listing

Bloom Energy hogged the limelight in 2010 when it was disclosed that companies including Google, Walmart and Bank of America were using his company’s fuel-cell based electricity generating boxes.

Qualcomm drops $44 billion NXP bid after failing to win China approval

The collapse of the Qualcomm-NXP deal is likely to aggravate tensions between Washington and Beijing, damage China's image as an antitrust regulator and discourage deals that need Chinese approval

How To Recover Your Forgotten Snapchat Password

Password managers help you by saving those complex passwords that can be pretty difficult to remember. If you’ve forgotten your Snapchat password, you can’t really recover that same password, but it’s easy enough to recover your account by resetting your password to something new.

The Best PC Gaming Headsets For Every Budget

PC gaming, gaming headsets, headset, headphones,

If you want to feel immersed in your PC games and effectively communicate in online multiplaye…

Click Here to Continue Reading

Thursday, 26 July 2018

Geek Trivia: Besides Humans, The Only Mammal Known To Love Hot Peppers Is The?

Think you know the answer? Click through to see if you're right!

Give All Your Smart Home Gadgets Unique Names, Even Across Different Services

Most smart home gadgets like Hue or Nest will make you use unique names within their service.

Click Here to Continue Reading

rquery: Practical Big Data Transforms for R-Spark Users

This is a guest community blog from Nina Zumel and John Mount, data scientists and consultants at Win-Vector. They share how to use rquery with Apache Spark on Databricks

Try this notebook in Databricks

Introduction

In this blog, we will introduce rquery, a powerful query tool that allows R users to implement powerful data transformations using Apache Spark on Databricks. rquery is based on Edgar F. Codd’s relational algebra, informed by our experiences using SQL and R packages such as dplyr at big data scale.

Data Transformation and Codd’s Relational Algebra

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

select (row selection)
project (column selection/aggregation)
Cartesian product (table joins, row binding, and set difference)
extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

SQL vs pipelines for data transformation

Without a pipe-line based operator notation, the common ways to control Spark include SQL or sequencing SparkR data transforms. rquery is a complementary approach that can be combined with these other methodologies.

One issue with SQL, especially for the novice SQL programmer, is that it can be somewhat unintuitive.

SQL expresses data transformations as nested function composition
SQL uses some relational concepts as steps, others as modifiers and predicates.

For example, suppose you have a table of information about irises, and you want to find the species with the widest petal on average. In R the steps would be as follows:

Group the table into Species
Calculate the mean petal width for each Species
Find the widest mean petal width
Return the appropriate species

We can do this in R using rqdatatable, an in-memory implementation of rquery:

library(rqdatatable)
## Loading required package: rquery
data(iris)

iris %.>%
  project_nse(., groupby=c('Species'),
              mean_petal_width = mean(Petal.Width)) %.>%
  pick_top_k(.,  
             k = 1,
             orderby = c('mean_petal_width', 'Species'),
             reverse = c('mean_petal_width')) %.>% 
  select_columns(., 'Species')
##      Species
## 1: virginica

Of course, we could also do the same operation using dplyr, another R package with Codd-style operators. rquery has some advantages we will discuss later. In rquery, the original table (iris) is at the beginning of the query, with successive operations applied to the results of the preceding line. To perform the equivalent operation in SQL, you must write down the operation essentially backwards:

SELECT 
   Species 
FROM (
   SELECT 
     Species, 
     mean('Petal.Width') AS mean_petal_width 
   FROM 
     iris
   GROUP BY Species ) tmp1
WHERE mean_petal_width = max(mean_petal_width) /* try to get widest species */
ORDER_BY Species /* To make tiebreaking deterministic */
LIMIT 1     /* Get only one species back (in case of ties) */

In SQL the original table is in the last or inner-most SELECT statement, with successive results nested up from there. In addition, column selection directives are at the beginning of a SELECT statement, while row selection criteria (WHERE, LIMIT) and modifiers (GROUP_BY, ORDER_BY) are at the end of the statement, with the table in between. So the data transformation goes from the inside of the query to the outside, which can be hard to read — not to mention hard to write.

rquery represents an attempt to make data transformation in a relational database more intuitive by expressing data transformations as a sequential operator pipeline instead of nested queries or functions.

rquery for Spark/R developers

For developers working with Spark and R, rquery offers a number of advantages. First, R developers can run analyses and perform data transformations in Spark using an easier to read (and to write) sequential pipeline notation instead of nested SQL queries. As we mentioned above, dplyr also supplies this capability, but dplyr is not compatible with SparkR — only with sparklyr. rquery is compatible with both SparkR and sparklyr, as well as with Postgres and other large data stores. In addition, dplyr’s lazy evaluation can complicate the running and debugging of large, complex queries (more on this below).

The design of rquery is database-first, meaning it was developed specifically to address issues that arise when working with big data in remote data stores via R. rquery maintains complete separation between the query specification and query execution phases, which allows useful error-checking and some optimization before the query is run. This can be valuable when running complex queries on large volumes of data; you don’t want to run a long query only to discover that there was an obvious error on the last step.

rquery checks column names at query specification time to ensure that they are available for use. It also keeps track of which columns from a table are involved with a given query, and proactively issues the appropriate SELECT statements to narrow the tables being manipulated.

This may not seem important on Spark due to its columnar orientation and lazy evaluation semantics, but can be a key on other data store and is critical on Spark if you have to cache intermediate results for any reason (such as attempting to break calculation lineage) and is useful when working traditional row-oriented systems. Also, the effect shows up on even on Spark once we work at scale. This can help speed up queries that involve excessively wide tables where only a few columns are needed. rquery also offers well-formatted textual as well as a graphical presentation of query plans. In addition, you can inspect the generated SQL query before execution.

Example

For our next example let’s imagine that we run a food delivery business, and we are interested in what types of cuisines (‘Mexican’, ‘Chinese’, etc) our customers prefer. We want to sum up the number of orders of each cuisine type (or restaurant_type) by customer and compute which cuisine appears to be their favorite, based on what they order the most. We also want to see how strong that preference is, based on what fraction of their orders is of their favorite cuisine.

We’ll start with a table of orders, which records order id, customer id, and restaurant type.

Cust ID	Resturant_type	order_id
cust_1	Indian	1
cust_1	Mexican	2
cust_8	Indian	3
cust_5	American	4
cust_9	Mexican	5
cust_9	Indian	6

To work with the data using rquery, we need a rquery handle to the Spark cluster. Since rquery interfaces with many different types of SQL-dialect data stores, it needs an adapter to translate rquery functions into the appropriate SQL dialect. The default handler assumes a DBI-adapted database. Since SparkR is not DBI-adapted, we must define the handler explicitly, using the function rquery::rquery_db_info(). The code for the adapter is here. Let’s assume that we have created the handler as db_hdl.

library("rquery")

print(db_hdl) # rquery handle into Spark
## [1] "rquery_db_info(is_dbi=FALSE, SparkR, <environment: 0x7fc2710cce40>)"

Let’s assume that we already have the data in Spark, as order_table. To work with the table in rquery, we must generate a table description, using the function db_td(). A table description is a record of the table’s name and columns; db_td() queries the database to get the description.

# assume we have data in spark_handle,
# make available as a view
SparkR::createOrReplaceTempView(spark_handle, "order_table")

# inspect the view for column names
table_description = db_td(db_hdl, "order_table")

print(table_description)
## [1] "table('`order_table`'; custID, restaurant_type, orderID)"
print(column_names(table_description))
## [1] "custID"          "restaurant_type" "orderID"

Now we can compose the necessary processing pipeline (or operator tree), using rquery’s Codd-style steps and the pipe notation:

rquery_pipeline <- table_description %.>%
  extend_nse(., one = 1) %.>%  # a column to help count
  project_nse(., groupby=c("custID", "restaurant_type"),
              total_orders = sum(one)) %.>% # sum orders by type, per customer
  normalize_cols(.,   # normalize the total_order counts
                 "total_orders",
                 partitionby = 'custID') %.>%
  rename_columns(.,  # rename the column
                 c('fraction_of_orders' = 'total_orders')) %.>% 
  pick_top_k(.,  # get the most frequent cuisine type
             k = 1,
             partitionby = 'custID',
             orderby = c('fraction_of_orders', 'restaurant_type'),
             reverse = c('fraction_of_orders')) %.>% 
  rename_columns(., c('favorite_cuisine' = 'restaurant_type')) %.>%
  select_columns(., c('custID', 
                      'favorite_cuisine', 
                      'fraction_of_orders')) %.>%
  orderby(., cols = 'custID')

Before executing the pipeline, you can inspect it, either as text or as a operator diagram (using the package DiagrammeR). This is especially useful for complex queries that involve multiple tables.

rquery_pipeline %.>%
  op_diagram(.) %.>% 
  DiagrammeR::DiagrammeR(diagram = ., type = "grViz")

Notice that the normalize_cols and pick_top_k steps were decomposed into more basic Codd operators (for example, the extend and select_rows nodes).

We can also look at Spark’s query plan through the Databricks user interface.

You can also inspect what tables are used in the pipeline, and which columns in those tables are involved.

tables_used(rquery_pipeline)
## [1] "order_table"
columns_used(rquery_pipeline)
## $order_table
## [1] "custID"          "restaurant_type"

If you want, you can inspect the (complex and heavily nested) SQL query that will be executed in the cluster. Notice that the column orderID, which is not involved in this query, is already eliminated in the initial SELECT (tsql_*_0000000000). Winnowing the initial tables down to only the columns used can be a big performance improvement when you are working with excessively wide tables, and using only a few columns.

cat(to_sql(rquery_pipeline, db_hdl))
## SELECT * FROM (
##  SELECT
##   `custID`,
##   `favorite_cuisine`,
##   `fraction_of_orders`
##  FROM (
##   SELECT
##    `custID` AS `custID`,
##    `fraction_of_orders` AS `fraction_of_orders`,
##    `restaurant_type` AS `favorite_cuisine`
##   FROM (
##    SELECT * FROM (
##     SELECT
##      `custID`,
##      `restaurant_type`,
##      `fraction_of_orders`,
##      row_number ( ) OVER (  PARTITION BY `custID` ORDER BY `fraction_of_orders` DESC, `restaurant_type` ) AS `row_number`
##     FROM (
##      SELECT
##       `custID` AS `custID`,
##       `restaurant_type` AS `restaurant_type`,
##       `total_orders` AS `fraction_of_orders`
##      FROM (
##       SELECT
##        `custID`,
##        `restaurant_type`,
##        `total_orders` / sum ( `total_orders` ) OVER (  PARTITION BY `custID` ) AS `total_orders`
##       FROM (
##        SELECT `custID`, `restaurant_type`, sum ( `one` ) AS `total_orders` FROM (
##         SELECT
##          `custID`,
##          `restaurant_type`,
##          1  AS `one`
##         FROM (
##          SELECT
##           `custID`,
##           `restaurant_type`
##          FROM
##           `order_table`
##          ) tsql_80822542854048008991_0000000000
##         ) tsql_80822542854048008991_0000000001
##        GROUP BY
##         `custID`, `restaurant_type`
##        ) tsql_80822542854048008991_0000000002
##      ) tsql_80822542854048008991_0000000003
##      ) tsql_80822542854048008991_0000000004
##    ) tsql_80822542854048008991_0000000005
##    WHERE `row_number` <= 1
##   ) tsql_80822542854048008991_0000000006
##  ) tsql_80822542854048008991_0000000007
## ) tsql_80822542854048008991_0000000008 ORDER BY `custID`

Imagine typing in the above SQL. Finally, we can execute the query in the cluster. Note that this same pipeline could also be executed using a sparklyr connection.

execute(db_hdl, rquery_pipeline) %.>%
  knitr::kable(.)

Cust ID	favorite_cuisine	fraction_of_orders
cust_1	Italian	0.3225806
cust_2	Italian	0.3125000
cust_3	Indian	0.2857143
cust_4	American	0.2916667
cust_5	American	0.2857143
cust_6	Italian	0.2800000
cust_7	American	0.2400000
cust_8	Indian	0.2903226
cust_9	Chinese	0.3000000

This feature can be quite useful for “rehearsals” of complex data manipulation/analysis processes, where you’d like to develop and debug the data process quickly, using smaller local versions of the relevant data tables before executing them on the remote system.

Conclusion

rquery is a powerful “database first” piped query generator. It includes a number of useful documentation, debugging, and optimization features. It makes working with big data much easier and works with many systems including SparkR, sparklyr, and PostgreSQL, meaning rquery does not usurp your design decisions or choice of platform.

What’s Next

rquery will be adapted to more big data systems allowing R users to work with the tools of their choice. rqdatatable will be extended to be a powerful in-memory system for rehearsing data operations in R.

You can try rquery on Databricks. Simpply follow the links below.

This example is available as a Databricks Community notebook here and as a (static) RMarkdown workbook here.

Try Databricks for free. Get started today.

The post rquery: Practical Big Data Transforms for R-Spark Users appeared first on Databricks.

Your Windows PC Might Finally Learn to Stop Rebooting While You’re Using It

Windows updates are a necessary evil for PC owners, which often result in the computer rebooting to successfully apply said update. With Windows 10, this is generally done automatically—and annoyingly.

You Can Pre-Order a Nintendo Switch Online Subscription, But You Probably Don’t Have To

How to Enable Dark Mode in Microsoft Edge

Microsoft Edge has a dark theme, but you have to enable it in Edge’s application settings. Even if you enable Windows 10’s dark theme, Edge will keep using its light app mode until you go out of your way to select dark mode.

Alexa Cast is Amazon’s Answer to Google Cast—But Just for Music

Google Cast is the gold standard for streaming content from your phone to speakers, TVs, and more, but something that Amazon has refused adding to its devices and services. So instead, it created its own.

How to Add Imported and iCal Calendars to Google Home

If you use multiple calendars on the daily and you also use Google Home, then you know how frustrating it is that Google Home doesn’t list all your calendars when you ask about your day. Fortunately, that’s changing.

The Best Streaming TV Services for People with Young Kids

The way we shop for things changes once kids come into the picture—it’s no longer just about what we need or want, but what’s…

Click Here to Continue Reading

How to Control Line and Paragraph Spacing in Microsoft Word

There are lots of reasons you might want to change the amount of space between lines in a paragraph, or between paragraphs themselves. Word offers some handy preset values to use, but you can also take full control by specifying exact spacing. Here’s how.

How to Find (Or Make) Free Ringtones

If you’ve gotten sick of the ringtones that come with your Android phone or iPhone, it’s easy enough to buy new ones. But before you do that, there are a number of sites where you can get free ringtones. We’ve rounded up some of the best.

The Best Clear iPhone 8 Cases To Protect (But Showcase) Your Phone

The iPhone 8 is a great looking phone and, naturally, you want to show off its stylish looks—but you also want to keep it safe from scra…

Click Here to Continue Reading

Amazon and Google Are Taking Over the Smarthome Industry

While the smarthome industry is filled with startups and smaller companies, it’s a market that’s increasingly being controlled by the big guys, namely Amazon and Google.

Wednesday, 25 July 2018

Join us at the Jenkins Contributor Summit San Francisco, Monday 17 September 2018

The Jenkins Contributor summit is where the current and future contributors of the Jenkins project get together. This summit will be on Monday, September 17th 2018 in San Francisco, just before DevOps World | Jenkins World. The summit brings together community members to learn, meet and help shape the future of Jenkins. In the Jenkins commmunity we value all types and sizes of contributions and love to welcome new participants. Register here.

Topics

There are plenty of exciting developments happening in the Jenkins community. The summit will feature a 'State of the Project' update including updates from the Jenkins officers. We will also have updates on the 'Big 5' projects in active development: * Jenkins Essentials * Jenkins X * Configuration as Code * Jenkins Pipeline * Cloud Native Jenkins Architecture

Plus we will feature a Google Summer of Code update, and more!

Agenda

The agenda is shaping up well and here is the outline so far.

9:00am Kickoff & Welcome with coffee/pastries
10:00am Project Updates
12:00pm Lunch
1.00pm BoF/Unconference
3.00pm Break
3.30pm Ignite Talks
5.00pm Wrap-up
6.00pm Contributor Dinner

The BoF (birds-of-a-feather) session will be an opportunity for in depth discussions, hacking or learning more about any of the big 5. Bring your laptop, come prepared with questions and ideas, and be ready for some hacking too if you want. Join in, hear the latest and get involved in any project during the BoF sessions. If you want to share anything there will be an opportunity to do a 5-min ignite talk at the end. Attending is free, and no DevOps World | Jenkins World ticket is needed, but RSVP if you are going to attend to help us plan. See you there!

Geek Trivia: Which Of These Games Is Credited With Defining The PC Adventure Gaming/RPG Genre?

Think you know the answer? Click through to see if you're right!

Free Download: Shockingly Complete Archive of Old Mac and iPhone Wallpapers

Love or hate Apple, you’ve got to admit: their background images are consistently stunning. Now you can download all of them.

The Best Portable Battery Packs For Every Situation

Your smartphone is kind of like a supervillain: it’s super-intelligent and has fantastic capabilities, but it’s also insanely hungry for…

Click Here to Continue Reading

These Popular Browser Extensions Are Leaking Your Full Web History On Purpose

Browser extensions and phone apps used by a combined 11 million people are recording users’ complete browsing history, a clear violation of privacy.

Bay Area Apache Spark Meetup Summary @ Databricks HQ

On July 19, we held our monthly Bay Area Spark Meetup (BASM) at Databricks, HQ in San Francisco. At the Spark + AI Summit in June, we announced two open-source projects: Project Hydrogen and MLflow.

Partly to continue sharing the progress of these open-source projects with the community and partly to encourage community contributions, two leading engineers from the respective teams talked at this meetup, providing technical details, roadmaps, and how the community can get involved.

First, Xiangrui Meng presented Project Hydrogen: State-of-the-Art Deep Learning on Apache Spark, in which he elaborated minor shortcomings in using Apache Spark with deep learning frameworks and how this endeavor resolves it and integrates these frameworks as first-class citizens, taking advantage of Spark’s distributed computing nature and fault-tolerance capabilities at scale.

Second, Aaron Davidson shared MLflow: Infrastructure for a Complete Machine Learning Life Cycle. He spoke about challenges in ML cycle today and detailed how experimenting, tracking, deploying, and serving machine learning models can be achieved using MLflow’s modular components and APIs for an end-to-end machine learning model lifecycle.

View Slides to Project Hydrogen Presentation
View Slides to MLflow Presentation

You can peruse slides and watch the video at your leisure. To those who helped and attended, thank you for participating and continued community support.

Try Databricks for free. Get started today.

The post Bay Area Apache Spark Meetup Summary @ Databricks HQ appeared first on Databricks.

Facebook’s New Watch Party Lets You Watch Facebook Videos With Your Friends

How to Turn Off New Message Alerts in Microsoft Outlook 2016 or 365

Outlook can trigger several types of alerts when you get a new message. You might see a taskbar notification, notice a message icon in your system tray, hear a sound, or even see your pointer change briefly to an email message icon. Here’s how to disable them all.

How to Leave Windows 10’s S Mode

Some PCs, including Microsoft’s Surface Laptop and the Windows on ARM PCs, run “Windows 10 in S Mode.” In S Mode, Windows can only run apps from the Store—but you can leave S Mode in a few clicks.

The Best Home Network Switches to Expand Your Router For Every Need

If you’re lucky, your router came with a few Ethernet ports.

Click Here to Continue Reading

How to Use Footnotes and Endnotes in Microsoft Word

Whether you use Microsoft Word for personal or professional writing, sometimes you may want to add supplemental notes to sections of your work. Maybe you want to make a side comment on one of your arguments, or you need to cite another author’s work without distracting from the main text. Luckily, Word has useful tools for adding footnotes and endnotes to your writing.

How to Delete Your YouTube Watch History (and Search History)

YouTube remembers every video you’ve ever watched, assuming you’re signed in with your Google account. YouTube uses this history for recommendations, and even encourages you to re-watch old videos. Here’s how to clean up your watch history—or stop collecting it.

ET Startup Awards 2018 Top Innovator: Sigtuple has its finger on the pulse

“The award is a big recognition for us, especially coming from such an eminent jury and The Economic Times. It is a big motivation for what we are trying to achieve,” said Rohit Kumar Pandey, CEO of SigTuple.

How to Properly Wrap Charging Cables to Prevent Damaging Them

How do you wrap your charging cables for storage or travel? There’s a chance that you might be doing it wrong and causing harm to the cable. Here’s how to properly wrap your cables so that they last as long as possible.

The Five Must-Have Tools and Accessories for Charcoal Grilling

Grilling with charcoal is enjoyable, cheap, and usually results in better-tasting food, but it’s a little trickier than propane.

Click Here to Continue Reading

The Best Cameras for Beginners

Smartphone cameras are great, but …

Click Here to Continue Reading

MLflow v0.3.0 Released

Today, we’re excited to announce MLflow v0.3.0, which we released last week with some of the requested features from internal clients and open source users. MLflow 0.3.0 is already available on PyPI and docs are updated. If you do pip install mlflow as described in the MLflow quickstart guide, you will get the recent release.

In this post, we’ll describe a couple new features and enumerate other items and bug fixes filed as issues on the Github repository.

GCP-Backed Artifact Support

We’ve added support for storing artifacts in Google Storage, through the --default-artifact-root parameter to the mlflow server command. This makes it easy to run MLflow training jobs on multiple cloud instances and track results across them. The following example shows how to launch the tracking server with a GCP artifact store. Also, you will need to setup Authentication as described in the documentation. This item closes issue #152.

mlflow server --default-artifact-root g3://my-mlflow-google-bucket/

Apache Spark MLlib Integration

As part of MLflow’s Model component, we have added Spark MLlib model as a model flavor.
This means that you can export Spark MLlib models as MLflow models. Exported models when saved using MLlib’s native serialization can be deployed and loaded as Spark MLlib models or as Python Function within MLflow. To save and load these models, use the spark.mflow API. This addresses issue #72. For example, you can save a Spark MLlib model, as shown in the code snippet below:


from pyspark.ml.pipeline.PipelineModel
from mlflow import spark

tokenizer = Tokenizer(inputCol="review", outputCol="words")
hashingTF = HashingTF(inputCol="words", outputCol="features")
lasso = LinearRegression(labelCol="rating", elasticNetParam=1.0, maxIter=20)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lasso])
model = pipeline.fit(dataset)
mlflow.spark.log_model(model, "spark-model")

Now we can access this MLlib persisted model in an MLflow application.

from mlflow import spark

model = mlflow.spark.load_model("spark-model")
df = model.transform(test_df)

Other Features and Bug Fixes

In addition to these features, other items, bugs and documentation fixes are included in this release. Some items worthy of note are:

[SageMaker] Support for deleting and updating applications deployed via SageMaker (issue #145)
[SageMaker] Pushing the MLflow SageMaker container now includes the MLflow version that it was published with (issue #124)
[SageMaker] Simplify parameters to SageMaker deployment by providing sane defaults (issue #126)

The full list of changes and contributions from the community can be found in the CHANGELOG. We welcome more input on mlflow-users@googlegroups.com or by filing issues or submitting patches on GitHub. For real-time questions about MLflow, we’ve also recently created a Slack channel for MLflow.

For an overview of what we’re working on next, take a look at the roadmap slides in our presentation from last week’s Bay Area Apache Spark Meetup or watch the meetup presentation.

Try Databricks for free. Get started today.

The post MLflow v0.3.0 Released appeared first on Databricks.

Tuesday, 24 July 2018

Geek Trivia: The Peace Symbol Is A Stylized Representation Of?

Think you know the answer? Click through to see if you're right!

The iPod is Dead, and So Is Listening to Music Without Distractions

It’s the end of an era: Apple killed off the last of its dedicated music players last week.

MoviePass Is Burning Money to Build a Userbase, But That’s Hardly Unique

MoviePass is losing money on basically every customer. Should you care?

Whoops, Philips Hue Accidentally Announced Outdoor Light Strips

How To Search Through (And Delete) Your Old Tweets

The nature of Twitter allows people to amass thousands of posts after a few years of use, perhaps not all of which are well thought out. High-profile celebrities and politicians on Twitter frequently regret a poorly-phrased or simply offensive post. But going through that massive backlog of tweets can be a daunting task. Here are a few ways to do it more efficiently.

How to Fix All Your Samsung Phone’s Annoyances

Samsung is the biggest manufacturer of Android phones in the world, but that doesn’t mean these handsets are perfect out of the box. In fact, most of these phones have several annoyances initially—here’s how to fix many of these.

The Best Portable Hard Drives For Every Need

Your laptop only has so much storage on it.

Click Here to Continue Reading

How to Create and Format a Text Box in Microsoft Word

Text boxes let you emphasize or bring focus to specific text in a Microsoft Word document. You can choose from a variety of preformatted text boxes, or draw and format your own. They’re great for adding things like pull quotes, or even for laying out text and images on things like flyers.

What is the New “Block Suspicious Behaviors” Feature in Windows 10?

Windows 10’s Redstone 5 update, scheduled for release in Fall, 2018, includes a new “Block Suspicious Behaviors” security feature. This protection is off by default, but you can enable it to protect your PC from a variety of threats.

Jenkins Javadoc: Service Improvements

Jenkins infrastructure is continuously improving. The latest service to get some attention and major improvement is the Jenkins javadoc.

There were a number of issues affecting that service:

Irregular updates - Developers wouldn’t find the latest java documentation because of inadequate update frequence.
Broken HTTPS support - when users would go to the Javadoc site they would get an unsafe site warning and then an incorrect redirection.
Obsolete content - Javadoc was not cleaned up correctly and plenty of obsolete pages remained which confused end users.

As Jenkins services migrate to Azure infrastructure, something that needed to be done was to move the javadoc service there as a standalone service. I took the same approach as jenkins.io, putting data on an azure file storage, using a nginx proxy in front of it and running on kubernetes. This approach brings multiple benefits:

We store static files on an azure file storage which brings data reliability, redundancy, etc.
We use Kubernetes Ingress to configure HTTP/HTTPS endpoint
We use Kubernetes Service to provide load balancing
We use Kubernetes deployment to deploy default nginx containers with azure file storage volume.

HTTP/HTTPS workflow

  +----------------------+     goes on     +------------------------------+
  |  Jenkins Developer   |---------------->+  https://javadoc.jenkins.io  |
  +----------------------+                 +------------------------------+
                                                                      |
  +-------------------------------------------------------------------|---------+
  | Kubernetes Cluster:                                               |         |
  |                                                                   |         |
  | +---------------------+     +-------------------+     +-----------v------+  |
  | | Deployment: Javadoc |     | Service: javadoc  <-----| Ingress: javadoc |  |
  + +---------------------+     +-------------------+     +------------------+  |
  |                                           |                                 |
  |                          -----------------+                                 |
  |                          |                |                                 |
  |                          |                |                                 |
  | +------------------------v--+    +--------v------------------+              |
  | | Pod: javadoc              |    | Pod: javadoc              |              |
  | | container: "nginx:alpine" |    | container: "nginx:alpine" |              |
  | | +-----------------------+ |    | +-----------------------+ |              |
  | | | Volume:               | |    | | Volume:               | |              |
  | | | /usr/share/nginx/html | |    | | /usr/share/nginx/html | |              |
  | | +-------------------+---+ |    | +----+------------------+ |              |
  | +---------------------|-----+    +------|--------------------+              |
  |                       |                 |                                   |
  +-----------------------|-----------------|-----------------------------------+
                          |                 |
                          |                 |
                       +--+-----------------+-----------+
                       |   Azure File Storage: javadoc  |
                       +--------------------------------+

The javadoc static files are now generated by a Jenkins job regularly and then published from a trusted jenkins instance. We only update what has changed and remove obsolete documents. More information can be find here

The next thing in continuously improving is also to look at the user experience of the javadoc to make it easier to discover javadoc for other components or versions. (Help Needed)

These changes all go towards improving the developer experience for those using javadocs and making life easier for core and plugin developers. See the new and improved javadoc service here Jenkins Javadoc.

Here’s How Gmail, Calendar, And Other Google Apps Might Look Soon

Google is overhauling all its apps, and a new video gives you an idea of what that might look like. Spoiler: it’s all very white.

How to Limit Notifications on Your Nest Cam Using Activity Zones

It’s likely that you receive a ton of irrelevant false positive motion alerts from your Nest Cam—a car driving by, a bug flying through the frame, or a bush off to the right waving around in the wind. Here’s how you can limit those kinds of notifications using Activity Zones.

The Best Toiletries and Gear to Put In Your Dopp Kit

A Dopp kit—named for American leatherworker Charles Doppelt who popularized the design—is an essential bit of travel gear.

Click Here to Continue Reading

Monday, 23 July 2018

The Best Way to Get a Phone Number for Your Small Business

Lots of small businesses use their personal cellphones when making work related phone calls. Some may even be using old landlines for their calling needs. While it makes sense to use your cellphone, and it can be scary to make a change, owning a business in the 21st century may require an upgrade.

Geek Trivia: The First Commercially Viable Strain Of Penicillin Was Created From Mold Found On A?

Think you know the answer? Click through to see if you're right!

Netgear’s Arlo Adds a Smart, Audio-Only Doorbell to Its Home Security Product Line

Netgear has a line of security cameras under the Arlo brand, and now the company is expanding with a…

Click Here to Continue Reading

Buy It For Life: Our Office Gear That’s Stood the Test of Time

While the up front cost can be high, buying gadgets for life can save you a lot of hassle over the long run.

Click Here to Continue Reading

Your HomePod Could Soon Support Phone Calls

HomePod, Apple’s smart speaker, could get new phone features this year, allowing you to initiate phone calls.

A Guide to AI, Machine Learning, and Deep Learning Talks at Spark + AI Summit Europe

Within a couple of years of its release as an open-source machine learning and deep learning framework, TensorFlow has seen an amazing rate of adoption. Consider the number of stars on its github page: over 105K; look at the number of contributors: 1500+; and observe its growing penetration and pervasiveness in verticals: from medical imaging to gaming; from computer vision to voice recognition and natural language processing.

As in Spark + AI Summit in San Francisco, so too in Spark + AI Summit Europe, we have seen high-caliber technical talks about the use of TensorFlow and other deep learning frameworks as part of the new tracks: AI, Productionizing ML, and Deep Learning Techniques. In this blog, we highlight a few talks that caught our eye, in their promise and potential. It always helps to have some navigational guidance if you are new to the summit or technology.

For example, look how Logical Clocks AB is using Apache Spark and TensorFlow to herald novel methods for developing distributed learning and training. Jim Dowling in his talk, Distributed Deep Learning with Apache Spark and TensorFlow, will explore myriad ways Apache Spark is combined with deep learning frameworks such as TensorFlow, TensorFlowonSpark, Horovod, and Deep Learning Pipelines to build deep learning applications.

Closely related to the above talk in integrating popular deep learning frameworks, such as TensorFlow, Keras or PyTorch, as first-class citizens on Apache Spark is Project Hydrogen: Unifying State-of-the-Art AI and Big Data in Apache Spark. Messrs Tim Hunter and Xiangrui Meng will share how to unify data and AI in order to simplify building production-ready AI applications. Continuing on how Spark integrates well with deep learning frameworks, Messrs Tim Hunter and Debajyoti Roy will discuss how to accomplish Geospatial Analytics with Apache Spark and Deep Learning Pipelines.

Now, if you are a news junky and wonder how to decipher and discern its emotive content, you may marvel at the use of sophisticated algorithms and techniques behind it and how to apply it. In this fascinating use case, An AI Use Case: Market Event Impact Determination via Sentiment and Emotion Analysis, Messrs Lei Gao and Jin Wang, both from IBM, will reveal the technical mastery. Similarly, if you are curious how social media images, when analyzed and categorized employing AI, are engaging, say for marketing campaigns, this session from Jack McCush will prove equally absorbing: Predicting Social Engagement of Social Images with Deep Learning.

At the Spark + AI Summit in San Francisco, Jeffrey Yau’s talk on Time Series Forecasting Using Recurrent Neural Network and Vector Autoregressive Model: When and How was a huge hit. He will repeat it in London on how two specific techniques—Vector Autoregressive (VAR) Models and Recurrent Neural Network (RNN)—can be applied to financial models. Also, IBM’s Nick Pentreath will provide an overview of RNN, modeling time series and sequence data, common architectures, and optimizing techniques in his talk, Recurrent Neural Networks for Recommendations and Personalization.

Data and AI are all about scale—about productizing ML models, about managing data infrastructure that supports the ML code. Three talks seem to fit the bill that may be of interest to you. First is from Josef Habdan, an aviation use case aptly titled, RealTime Clustering Engine Using Structured Streaming and SparkML Running on Billions of Rows a Day. Second is from Messrs Gaurav Agarwal and Mani Parkhe, from Uber and Databricks, Intelligence Driven User Communications at Scale. And third deals with productionizing scalable and high volume Apache Spark pipelines for smart-homes with hundreds of thousands of users. Erni Durdevic of Quby in his talk will share, Lessons Learned Developing and Managing High Volume Apache Spark Pipelines in Production.

At Databricks we cherish our founders’ academic roots, so all previous summits have had research tracks. Research in technology heralds paradigm shifts—for instance, at UC Berkeley AMPLab, it led to Spark; at Google, it led to TensorFlow; at CERN it led to the world wide web. Nikolay Malitsky’s talk, Spark-MPI: Approaching the Fifth Paradigm, will address the existing impedance mismatch between data-intensive and compute-intensive ecosystems by presenting the Spark-MPI approach, based on the MPI Process Management Interface (PMI). And Intel researchers Qi Xie and Sophia Sun will share a case study: Accelerating Apache Spark with FPGAs: A Case Study for 10TB TPCx-HS Spark Benchmark Acceleration with FPGA.

And finally, if you’re new to TensorFlow, Keras or PyTorch and want to learn how it all fits in the grand scheme of data and AI, you can enroll in a training course offered on both AWS and Azure: Hands-on Deep Learning with Keras, Tensorflow, and Apache Spark. Or to get a cursory and curated Tale of Three Deep Learning Frameworks, attend Brooke Wenig’s and my talk.

What’s Next

You can also peruse and pick sessions from the schedule, too. In the next blog, we will share our picks from sessions related to Data Science, Developer and Deep Dives.

If you have not registered yet, take advantage of the early bird before July 27, 2018 and save £300. See you there!

Find out more about initial keynotes: Spark + AI Summit Europe Agenda Announced

Try Databricks for free. Get started today.

The post A Guide to AI, Machine Learning, and Deep Learning Talks at Spark + AI Summit Europe appeared first on Databricks.

Tuesday, 31 July 2018

Introduction

How to Use Data Skipping and ZORDER Clustering

How Data Skipping and ZORDER Clustering Work

Data Skipping

ZORDER Clustering

An Example in Cybersecurity Analysis

Conclusion

Read More

What are Special Interest Groups?

Cloud Native SIG

What’s next in the SIG?

Artifact Storage

Log storage

Configuration storage

Conclusions

Monday, 30 July 2018

Sunday, 29 July 2018

Saturday, 28 July 2018

Friday, 27 July 2018

Thursday, 26 July 2018

Introduction

Data Transformation and Codd’s Relational Algebra

SQL vs pipelines for data transformation

rquery for Spark/R developers

Example

Conclusion

What’s Next

Read More

Wednesday, 25 July 2018

Topics

Agenda

GCP-Backed Artifact Support

Apache Spark MLlib Integration

Other Features and Bug Fixes

Read More

Tuesday, 24 July 2018