Data Science, Machine Learning, Natural Language Processing, Text Analysis, Recommendation Engine, R, Python
Wednesday, 31 January 2018
How to Add Phonetic Names to Contacts on the iPhone
What’s the Difference Between Microsoft Office for Windows and macOS?
Google Flights Will Now Tell You If Your Flight Is Likely to Be Delayed

The very last thing you want to find out after you rush through security at the airport is that your flight has been delayed.
Click Here to Continue Reading
How to Stream From VLC to Your Chromecast
What Are “Right to Repair” Laws, and What Do They Mean for You?
Mahindra & Mahindra take tech companies’help to revamp its digital platform
How to Control Your Amazon Echo from Anywhere Using Your Phone
Geek Trivia: The First Handheld Game System To Use Interchangeable Cartridges Was The?
Tuesday, 30 January 2018
How Do YouTube Channels Make Money?
How to Take Screenshots of Your PC Games
How to Play Wii and GameCube Games on your PC with Dolphin
First-Class Support for Long Running Services on Apache Hadoop YARN
Introduction Apache Hadoop YARN is well known as the general resource-management platform for big-data applications such as MapReduce, Hive / Tez and Spark. It abstracts the complicated cluster resource management and scheduling from higher level applications and enables them to focus solely on their own application specific logic. In addition to big-data apps, another broad […]
The post First-Class Support for Long Running Services on Apache Hadoop YARN appeared first on Hortonworks.
Change Your Smartphone Case Like You Change Your Clothes

We wear different clothes for different activities and we use different tools for different tasks, but for some reason the majority of us use t…
Click Here to Continue Reading
How to Stream Your PC Gameplay With NVIDIA GeForce Experience
Are Third Party Camera Lenses Worth Buying?
Do You Realize How Much You Share Your Location?
How Difficult Is It to Replace an iPhone Battery?
Geek Trivia: Which Of These Fruits Contains A Powerful Natural Meat Tenderizer?
Monday, 29 January 2018
What Is the “commerce” Process, and Why Is It Running on My Mac?
Save 10% on Instacart by Opting Out of the Service Fee
Elon Musk’s $500 Flamethrower Probably Shouldn’t Be Used For Crème Brûlée

Elon Musk is perhaps best known as the guy who made those awesome electric cars, and slightly less as one of the guys who invented PayPal.
Click Here to Continue Reading
Don’t Bother with a Dedicated 4K Blu-ray Player, Buy an XBox One Instead

If you’ve recently purchased a 4K TV, it’s natural that you want to enjoy stunning content on it.
Click Here to Continue Reading
How to Automatically Mute New Tabs in Chrome and Firefox
How to Add Phonetic Names to Contacts in Android (So Google Assistant Can Understand You)
Alibaba, Foxconn lead $350M funding in electric car startup
You are invited to the Post-FOSDEM 2018 Jenkins Hackfest
On the first weekend in February, numerous free and open source developers from around the world will travel to Brussels, Belgium, for arguably the largest event of its kind: FOSDEM. Among the thousands of hackers in attendance will be a number of Jenkins contributors.
On the Monday after FOSDEM, you are invited to join a group of those contributors for a full day of hacking on Jenkins. Folks of all experience levels are welcome; there will be sessions for everyone from seasoned hackers to new contributors.
All-day Jenkins Hackfest
The Hackfest will start at 9:30am on Monday with general introductions and gathering of potential topics/projects. Bring suggestions for topics that interest you, or just come and choose from the topics others suggest. There will be plenty of topics from which to choose. All the topic suggestions will all be added to a backlog and we will identify and cluster around popular topics. Then we’ll divide into smaller groups and work on individual topics in timeboxed sessions.
Meals, snacks, and beverages will be provided throughout the day, wrapping up with dinner around 5pm.
Something for everyone
Hackfests like this one are a great opportunity for contributors of all levels to get invovled, learn from each other, and work together on interesting and high impact areas of the project.
Some long-time contributors already know what areas they’ll work on and are looking for people interested in joining them. Mark Waite (maintainer of the Jenkins Git and Git Client plugins) and Christian Halstrick (SAP) will be spending the day improving the way Git client plugin uses JGit. R. Tyler Croy (Jenkins community concierge) and Olivier Vernin (Jenkins infrastructure engineer) will work on infrastructure improvements.
Others contributors, such as Jesse Glick and Andrew Bayer (recipients of the "A Small Matter of Programming" award), will arrive without a set plan. They will, of course, have some topics to propose, so you might get a chance to work with them. Or if you have an area you’d like to work on, they and many other experts will be on hand for discussion and code review.
This is also a great opportunity for new contributors to join the project. Baptiste Mathus, long time contributor and all-around nice guy, will host a "New Contributor Hackergarten" covering the basics of contributing to Jenkins and submitting fixes via GitHub Pull requests. Even those with minimal coding experience can contribute by improving documentation and making typo fixes via this same process.
Fun!
More than anything else, Hackfests like this are great fun. No matter what your level of exerience, there will be plenty to do and great people with whom to do it. Reserve a space by joining the meetup here. Then bring your own laptop and passion for improving Jenkins.
Details
-
Date: Monday, February 5, 2018
-
Time: 9:30 AM to 5:00 PM
-
Location: BeCentral sprl/bvba Cantersteen 12 1000 Brussel Belgium Room: Studio C (1st floor)
Meals, snacks, and beverages will be provided. Bring your own computer.
| Ask for "Jenkins" at the front desk if you get lost. |
Indian govt testing LiFi tech to transmit very high speed Internet
Microsoft India project aims to help monitor driver behaviour
Reliance Jio turns focus towards IoT space in India
Geek Trivia: Leafcutter Ants Don’t Eat Leaves, But Use The Leaves To?
Sunday, 28 January 2018
How to Avoid Malware on Android
Geek Trivia: Ben Franklin Wasn’t Just A U.S. Founding Father, But The Founder Of Modern?
Saturday, 27 January 2018
The Best Apps and Tools for Chromebooks
How to Connect the Nest Secure to a New Wi-Fi Network
Geek Trivia: The Largest Sea Plane Ever Built Was The?
Friday, 26 January 2018
Women in Tech: Part 1 – Sr SQA Automation Engineer
My name is Yesha Vora. My early acquaintance to computer engineering happened when I was handed off a book called “Learning C” by my school teacher. I pursued a Bachelor’s degree in computer engineering from my hometown in India and completed my Masters in the United States. During my Masters program at San Jose State […]
The post Women in Tech: Part 1 – Sr SQA Automation Engineer appeared first on Hortonworks.
Want Better Workplace Focus? Listen to a Video Game Soundtrack

Want a little productivity boost to get through your Friday (and every day after it, for that matter)?
Click Here to Continue Reading
How to Make LibreOffice Writer Templates
Roav VIVA Review: Kick Siri to the Curb and Make Alexa Your New Copilot

Alexa, thanks to the popularity of the Echo, might have a firm standing in millions of homes but outside the house Siri and Google Assistant sti…
Click Here to Continue Reading
How to Fix Android MMS Issues on Cricket Wireless
What’s the Difference Between the iPad, iPad Pro, and iPad Mini?
Cheap Windows Laptops Will Only Waste Your Time and Money
How to Fix Annoying Nest Secure Notifications
Geek Trivia: The First Cartoon Aired With Stereo Sound Was?
Thursday, 25 January 2018
What Is nsurlstoraged, and Why Is It Running on my Mac?
How to Read Comic Books and Manga on Your Kindle
How to Check Your Apple Pencil’s Battery Level
Introduction to Apache NiFi
Nearly ten years ago, I was presented with an amazing opportunity. I was fortunate enough to join a team of three incredibly talented engineers to build a new platform. It would be responsible for handling the ever-increasing volumes of data that would be streamed through my organization. The platform would have to allow users to […]
The post Introduction to Apache NiFi appeared first on Hortonworks.
The Best Portable Chargers For Every Need

If you’re lucky your phone can last a full day before you need to reach for a power cable.
Click Here to Continue Reading
How to Make Google Home Use Your Netflix Profile Based on Your Voice
What Are Samsung’s Micro LED TVs, and How Are They Different from OLED?
RBI warns banks about risks of Cryptocurrencies, wants higher scrutiny
With Moon race called off, ISRO's Antrix & TeamIndus terminate agreement
Accelerate Innovation with Microsoft Azure Databricks
It’s hard to believe that we are already three weeks into 2018. If you’re still struggling to get valuable insights from your data, now is the perfect time to try something new! We recently announced Azure Databricks, a fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure. With Azure Databricks, you can help your organization accelerate time to insights by providing a collaborative workspace and a high-performance analytics platform, all at the click of a button.
Spark on Azure enables data engineers and data scientists to increase performance and reduce costs by as much as 10-100x. How? Azure Databricks is the place to get Spark on Azure, optimized by the creators of Spark. It features optimized connectors to Azure storage platforms (e.g. Data Lake and Blob Storage) for the fastest possible data access and one-click startup directly from the Azure console, so you can get rolling faster. Notebooks on Databricks are live and shared, so that everyone in your organization can work with your data with real-time collaboration. It also features an integrated debugging environment to let you analyze the progress of your Spark jobs from within interactive notebooks. The bonus? Common analytics libraries, such as the Python and R data science stacks are preinstalled so that you can use them with Spark to derive insights.
We know that protecting your data and business is also critical with any analytics platform. With native Azure Active Directory integration, get peace of mind that you are building on a secure, trusted cloud with fine-grained user permissions, enabling secure access to Databricks Notebooks, clusters jobs and data.
Now is the perfect time to get started. Not sure how? Register for our January 25th webinar and we’ll walk you through the benefits of Spark on Azure, and how to get started with Azure Databricks.
Learn more about Azure Databricks today.
The post Accelerate Innovation with Microsoft Azure Databricks appeared first on Databricks.
Matei Zaharia’s 5 predictions about AI in 2018
Over the past few years, the demand for artificial intelligence (AI) and machine learning capabilities has surged with innovations in natural language processing, task automation, and predictions. From autonomous cars to a more personalized shopping experience, big data and artificial intelligence is at the forefront of new solutions that are delighting customers, improving business operations and enabling new products.
But AI is not as widespread as many would like. The reality is that there are only a handful of companies — technology giants such as Facebook, Google, and Uber — that have tapped into the promise of AI and actually accomplishing their goals with it. Most companies continue to struggle with adopting AI for several reasons, including that AI problems in many businesses are often harder than AI problems on the web, and lack of suitable technology platforms and expertise. As AI and big data technologies continue to evolve, however, the barrier to entry for many companies is gradually falling. If we look past the hype around AI, which trends and advances are likely to make a material difference in 2018?
Databricks’ Chief Technologist, Matei Zaharia, will host a webinar later this month on five key predictions for AI in 2018. Dr. Zaharia started the Apache Spark project, cofounded Databricks, and is also a faculty member at Stanford DAWN, a research lab on usable machine learning, so he will provide a broad view on AI’s impact in business. Matei will discuss five predictions:
- Data will be the central competitive advantage. AI products can only be as good as the data we feed them, so we are going to see a massive influx in organizations curating datasets for a competitive edge. In addition, cloud platforms that enable users to pay by the dataset or query will create new business opportunities for companies to commercialize their proprietary datasets.
- AI will find new use cases, starting with verticals. Vertical-specific solutions and libraries will start to incorporate the newest ML techniques and transform existing business processes, as we are already starting to see, for example, with Apache Spark based processing tools in genomics.
- Data scientists will continue to grow in number. Data science and computer science are currently some of the most popular majors for college students. Although an influx of new graduates will not quench the demand for these roles, we will see more organizations starting data science teams and data products in areas with proven business value.
- Deep learning frameworks will start to converge and move up in abstraction. Although there are currently a large number of machine learning frameworks, many of them offer similar functionality, and with efforts like ONNX, the basic tasks of defining and serving models will be handled well. Instead, developers will start to focus on making these frameworks easier to use for higher level business applications.
- The cloud will enable new data application architectures. We are already starting to see that data management systems for the cloud can do much more than on-premise systems deployed by forklift onto virtual machines. We expect new offerings that take advantage of the scalability, elasticity, availability and ease of management of the cloud to simplify the pipeline around big data and AI products.
To dig deeper into Matei’s thoughts on these trends, join him in an upcoming webinar hosted by Data Science Central, as he discusses how centering organizations around high-quality data will be the main driver to AI, which new AI applications are seeing success in practice, and how new technologies including deep learning, data marketplaces and the cloud will affect the computing landscape.
--
Try Databricks for free. Get started today.
The post Matei Zaharia’s 5 predictions about AI in 2018 appeared first on Databricks.
How to Set Up the TP-Link Wi-Fi Smart Plug
Geek Trivia: The Name of Google’s Short-Lived “Wave” Collaboration Framework Was Inspired By Which Sci-Fi Show?
Wednesday, 24 January 2018
Google Home Won’t Mix Up Your Family’s Netflix Profiles Anymore

You can use Google Assistant to play movies and TV shows from Netflix on your Chromecast.
Click Here to Continue Reading
How to Add Sideheads and Pull Quotes to Microsoft Word Documents
HDFS Tiering with Isilon and ECS without tiering policies
Guest post by Boni Bruno. Originally published for the Dell EMC Community. Many organizations use traditional, direct attached storage (DAS) Hadoop clusters for storing big data. As data requirements grow, organizations are finding traditional Hadoop storage architecture inefficient, costly, and difficult to manage. With most Hadoop deployments, as more and more data is stored for […]
The post HDFS Tiering with Isilon and ECS without tiering policies appeared first on Hortonworks.
7 Gadgets Under $50 That’ll Improve Your Kitchen Experience

You can get by in the kitchen with very few tools, but what’s the fun in that?
Click Here to Continue Reading
What Is Apple’s W1 Chip?
What Is 5G, and How Fast Will It Be?
Google Lunar XPRIZE competition to end without a winner
World's first robotic citizen Sophia to help drive research in artificial general intelligence
How Big Data is Paving the Way for the Connected Car
The International Consumer Electronics Show took place last week in Las Vegas, Nevada. CES is an annual trade show that showcases the newest products and technologies in the consumer electronics space. This year there were over 184,000 total attendees and 4,000 exhibiting companies. One of the hottest topics at the show was the connected car, […]
The post How Big Data is Paving the Way for the Connected Car appeared first on Hortonworks.
Don’t Worry, Amazon Echo Alarms Still Work Without Internet
Udacity launches nanodegree on flying cars
Geek Trivia: In British Culture, An “Agony Aunt” Dispenses?
Tuesday, 23 January 2018
How to Find Your Android Device’s Info for Correct APK Downloads
Is Now a Good Time to Buy a New NVIDIA or AMD Graphics Card?
What Are the Photoshop Express, Fix, Mix, and Sketch Mobile Apps?
It’s Time to Stop Buying Phones from OnePlus
Machines are rising in India but your job is still secure. Here's why
No Apple Store Nearby? Try an Apple Authorized Service Provider
Geek Trivia: The Largest Recorded Family Tree In The World Belongs To The Descendants Of?
Monday, 22 January 2018
What Is Apple’s “Secure Enclave”, And How Does It Protect My iPhone or Mac?
How to Disable the Lock Sound on an iPhone or iPad
What’s the Difference Between Android One and Android Go?
How to See Posts From Your Favorite Facebook Pages More Often
How to Overclock Your Intel Processor and Speed Up Your PC
Decoding blockchain for the Government
Travel startup ixigo launches AR feature on its app for train passengers
Samsung's Alex Hawkinson on smart-home evolution
Intel taking Artificial Intelligence to Motorsports, Hollywood
Geek Trivia: The Crowdfunded Film That Reached Its Funding Goal The Fastest Was?
Sunday, 21 January 2018
Rooting Android Just Isn’t Worth It Anymore
Geek Trivia: Captain Kirk’s Uniform Top In Star Trek Appeared Gold On Screen, But Was Actually?
Saturday, 20 January 2018
How to Disable the Pathlight on a Nest Detect
Geek Trivia: Which Of These Weather Phenomena Are Known As “St. Elmo’s Fire”?
Friday, 19 January 2018
How to protect your executive from Phishing and Ransomware
How to Hide Your Activity Status on Instagram
The Best Podcast Apps For Your Smartphone

If you’re not enjoying the wealth of podcasts out there, you’re really missing out. Podcasts provide you with the experience of a ra…
Click Here to Continue Reading
How to Share Your Work Under a Creative Commons License
What Is WPA3, and When Will I Get It On My Wi-Fi?
Ditch Your PC’s Calculator App and Use a Real One Instead
Geek Trivia: The First Metal Group To Reach The Top Of The Billboard Charts Was?
Thursday, 18 January 2018
How to Enable Fast User Switching in macOS
How to Get Medical Marijuana Card Tips
A Startling Fact about California Medical Marijuana Card Uncovered
Practical Tips for How to Get Medical Marijuana Card You Can Use Starting Today
Low Profile Switches Are Coming to Shrink Your Mechanical Keyboards
Add Instant Backseat Entertainment to Your Car with a Tablet Mount
Whether you want to mount a tablet, a large phone, or even a Nintendo Switch in your car to keep your kids entertained on the road, these sturd…
Click Here to Continue Reading
How to See Your Most Used Apps on Android
Should You Repair Your Own Phone or Laptop?
Vital Pieces of Luxury Safari Lodges in South Africa
The Tried and True Method for Luxury Safari Lodges in South Africa in Step by Step Detail
New Series: Women @ Hortonworks
Happy New Year! We have been looking back at some of the great achievements of the past year and one thing which really stands out is the important contributions we have had from Women at Hortonworks. We have women engineers contributing to Hortonworks products in various different ways – be it in engineering, customer support, […]
The post New Series: Women @ Hortonworks appeared first on Hortonworks.
Jenkins World 2018: Call for Papers is Open
|
This is a guest post by Alyssa Tong, who runs the Jenkins Area Meetup program and is also responsible for Marketing & Community Programs at CloudBees, Inc. |
Happy 2018! The Jenkins World train is ready to take off once again. As usual, the sign of festivities looming begins with the Call for Papers. Those who attended Jenkins World 2017 know that Jenkins World 2018 is coming back to San Francisco. But what they did not know is that Jenkins World will also be coming to Europe. You read that right, Jenkins World is taking place in two locations in 2018:
-
Jenkins World USA | San Francisco | September 16 - 19, 2018
-
Jenkins World Europe | October - date and location TBA
To encourage open collaboration and stimulate discussions that will help advance Jenkins adoption and drive it forward, we invite Jenkins users, developers and industry experts to submit a speaking proposal to Jenkins World San Francisco and or Europe. Submissions for both locations are being accepted now and will close on March 18, 2018 @ 11:59PM Pacific.
Where do I go to submit my proposal?
Submissions for both Jenkins World USA and Europe are accepted at:
Can I make proposal(s) to both conferences?
Yes, you can! Once you’ve created an account on the CFP website you will be given the option to make submission(s) to one conference or both conferences.
Important Dates:
-
CFP Opens: January 17, 2018
-
CFP Closes: March 18, 2018 @ 11:59pm Pacific
-
CFP Notifications: April
-
Agenda Announcement: April
-
Event Dates:
-
Jenkins World USA | San Francisco | September 16 - 19, 2018
-
Jenkins World Europe | October - exact date TBA
-
How to Get the Most Out of Your Eero Mesh Wi-Fi System
Geek Trivia: The Highest Grossing World War II Film Of All Time Is?
Wednesday, 17 January 2018
What Is UserEventAgent, and Why Is It Running on My Mac?
Google's Cloud AutoML makes it easier for businesses to build custom machine learning models
How to Report YouTube Videos and Comments
How to Hide Your Active Status on Facebook Messenger
Never Explain, never Protest.
How to Stop the Meltdown and Spectre Patches from Slowing Down Your PC
Karnataka woos techies, organises blockchain hackathon
Meltdown and Spectre’s Performance Impact on Big Data Workloads in the Cloud
Last week, the details of two industry-wide security vulnerabilities, known as Meltdown and Spectre, were released. These exploits enable cross-VM and cross-process attacks by allowing untrusted programs to scan other programs’ memory.
On Databricks, the only place where users can execute arbitrary code is in the virtual machines that run Apache Spark clusters. There, cross-customer isolation is handled at the VM level. The cloud providers that Databricks runs on, Azure and AWS, have both announced that they have patched their hypervisors to prevent cross-VM attacks. Databricks depends on our cloud providers’ hypervisors to provide security isolation across VMs and the applied hypervisor updates should be sufficient to protect against the demonstrated cross-tenant attack.
Aside from the security impacts, our users probably care the most about the degree of performance degradation introduced by the mitigation strategies. In our nightly performance benchmarks, we noticed some changes on January 3, when the exploits were disclosed. Our preliminary assessment is that we have observed a small degradation in AWS of 3% in most instances and up to 5% in a particular case, from the hypervisor updates. We have not included a preliminary assessment on Azure because we do not have the historical data that we would need to feel confident regarding such an assessment.
A snapshot of Databricks’ internal AWS performance dashboard (Y-axis is the geometric mean of runtime for some collection of workloads; different series indicate different configurations)
In this blog post, we analyze the potential performance impacts caused by the hypervisor mitigations for Meltdown and Spectre, using our nightly performance benchmarks. In a subsequent blog post, we also provide an overview of the exploits and mitigation strategies available.
Introduction
Big data applications are among the most resource-demanding cloud computing workloads. For some of our customers, a small slowdown in their systems might lead to millions of dollars increase in IT budget. With these exploits, the immediate question after the concern for security is this: how much will big data workloads slow down after the hypervisor mitigation patches are applied by cloud providers?
As mentioned earlier, we noticed a small spike after January 3rd on our internal performance dashboard. Spikes are occasionally caused by sources of random variability, so we had to wait until we had more data in order to draw firmer conclusions. Now that we have more data points, it has become more apparent that the hypervisor updates have performance implications on big data workloads.
The mitigation strategies for Meltdown and Spectre impact code paths that perform virtual function calls and context switches (such as thread switches, system calls, disk I/O, and network I/O interrupts).
To the best of our knowledge, no reports exist for workloads on big data systems running in the cloud, which typically exercise very different code paths from desktop or server applications. While there have been plenty of reports ranging from “negligible impact on performance” to 63% slowdown in FS-Mark, the closest example we have seen is the 7% to 23% degradation of Postgres, and some of our customers worry they will observe a similar performance hit for Apache Spark.
However, even though both are data systems, Spark’s data plane (executor) looks nothing like Postgres. One of the major goals of Project Tungsten in Spark 2.0 was to eliminate as many virtual function dispatches as possible through code generation. This, has the fortunate side effect of reducing the effect of Meltdown and Spectre’s current mitigations on Spark’s execution.
Spark’s control plane (driver) triggers operations that can be impacted more by the mitigations. For example, network RPCs trigger context switches, and the control plane code tends to perform more virtual function dispatches (due to the use of object-oriented programming and functions becoming megamorphic). The good news is that the driver is only responsible for scheduling and coordination, and has in general low CPU utilization. Consequently, we expect less performance degradation in Spark than transactional database systems like Postgres.
Methodology
Before presenting the benchmark results, we’ll first share how we conducted this analysis which leverages our nightly performance benchmarks.
Workloads: Our nightly benchmarks consist of hundreds of queries, including all 99 TPC-DS queries, running on two of the most popular instance types on AWS for big data workloads (r3 and i3). These queries cover different Spark use cases, ranging from interactive queries that scan small amounts of data to deep analytics queries that scan large amounts. In aggregate, they represent big data workloads in the cloud.
Going back in time: Part of the reason public benchmarks measuring the impact of these hypervisor fixes are so sparse is that the cloud providers applied their hypervisor changes soon after the exploits were disclosed, and there is now no way to go back in time and perform controlled experiments on unpatched machines. The most scientific measurement would require us to run the same set of workloads against an unpatched datacenter and against a patched datacenter. Absent of that, we leverage our nightly performance benchmark to analyze the degradation.
Noise: An additional challenge is that the cloud is inherently noisy, as are any shared resources and distributed systems. Network or storage performance might vary from time to time due to varying resource utilization. The following chart shows the runtime of a benchmark configuration last year, before any hypervisor fixes. As shown in the graph, even without any known major updates, we can see substantial variations in performance from run to run.
Variance and spikes exist in the cloud, even without security patches
As a result, we need a sufficient number of runs to remove the impact of noise. Therefore we cannot use a single run, or even two runs, to conclude the performance of a configuration.
Fortunately, we do have performance benchmarks that are run nightly, and we have accumulated 7 days of data before and after January 3 (the date it appears AWS applied the patch to our systems). For each day, hundreds of queries are run in multiple cluster configurations and versions, so in aggregate we have tens of thousands of query runs.
Isolating effect of software changes: While these benchmarks are used primarily to track performance regressions for the purpose of software development, we also run old versions of Spark to establish performance baselines. Using these old versions, we can isolate the effect of software improvements.
We also have additional release smoke tests that exercise a more comprehensive matrix of configurations and instance types, but they are not run as frequently. We do not report numbers for those tests because we feel that do not have enough data points to establish strong conclusions.
Degradation on Amazon Web Services
On AWS, we have observed a small performance degradation up to 5% since January 4th. On i3-series instance types, where we cache data on the local NVMe SSDs (Databricks Cache), we have observed a degradation up to 5%. On r3-series instance types, in which the benchmark jobs read data exclusively from remote storage (S3), we have observed a smaller increase of up to 3%. The greater percentage slowdown for the i3 instance type is explained by the larger number of syscalls performed when reading from the local SSD cache.
The chart below shows before and after January 3rd in AWS for a r3-series (memory optimized) and i3-series (storage optimized) based cluster. Both tests fixed to the same runtime version and cluster size. The data represents the average of the full benchmark’s runtime per day, for a total of 7 days prior to January 3 (before is in blue) and 7 days after January 3 (after is in red). We exclude January 3rd to prevent partial results. As mentioned, the i3-series has the Databricks Cache enabled on the local SSDs, resulting in roughly half of the total execution time (faster) compared to the r3-series results.
Runtimes of each instance type are normalized to the runtime on the original r3 series configuration (first) reflecting data collected both 7 days before and 7 days after January 3rd.
Given the above data, one might wonder whether the difference in degradation of the i-series v.s the r-series was due to the different instance configurations (and architecture generation), or due to Databricks Cache. As we cannot go back in time to repeat the tests without the patch in the cloud, to isolate the effect of Databricks Cache, we have compared the performance of the first run of the first query on both r3 and i3, and found that the two performed comparably. That is to say, the additional degradation was due to caching. With NVMe SSD caching, the CPU utilization is much higher and Spark is processing much more data per second, and triggers much more I/O, than without the cache. Nevertheless, the newer and storage optimized i3-series with caching on complete the same benchmark in about half the time.
What’s Next?
While it’s difficult to be conclusive given the lack of controlled experiments, we have observed a small performance degradation (2 to 5%) from the hypervisor updates on AWS. We expect this impact to decrease as the patch implementations improve over time.
We, however, are not done yet. The whole incident is still unfolding. The hypervisor updates applied by cloud vendors only mitigate cross-VM attacks. The industry as a whole is also working on mitigation strategies at the kernel and process level to prevent user applications from scanning memory they should not be allowed to.
Our cloud vendors’ rapid response to this problem reaffirms Databricks’ core tenet that moving data processing to the cloud enables security issues to be rapidly detected, mitigated, and fixed.
Even though our security architecture does not depend on kernel or process level isolation, out of an abundance of caution we are also taking prompt steps to patch the vulnerabilities at these levels. We are performing controlled experiments on their performance impact, but we do not yet feel we have collected sufficiently data to report. We will post updates as soon as possible.
--
Try Databricks for free. Get started today.
The post Meltdown and Spectre’s Performance Impact on Big Data Workloads in the Cloud appeared first on Databricks.
How to Buy Stuff at the Apple Store Without a Cashier
Geek Trivia: In English Culture, “Bangers” Are?
Tuesday, 16 January 2018
A Non-Comprehensive List of Messenger Features Facebook Could Cut
According to Facebook’s vice president of messaging product David Marcus, the Facebook Messenger app is “too cluttered.” We agree.
Click Here to Continue Reading
Alexa, Why Is Cortana Still on My Computer?
Moderne Ansätze zur Erstellung von Websites.
How exactly to Write a Blues Song-Writing Lyrics and Audio
How to Use Microsoft Word’s Compare Feature
How to Remap an Xbox One Controller’s Buttons on Windows 10
Can You Use Any Charger With Any Device?
Geek Trivia: Which Soda Company Briefly Owned A Sizable Naval Fleet?
6 Cheap Alternatives to Adobe Photoshop
Adobe Photoshop is easily the industry standard when it comes to graphic and photo ed…
Click Here to Continue Reading
Monday, 15 January 2018
Quitting a COBRA and Job Eligibility
How Much Data Does Netflix Use?
IoT-enabled smart meter helps hourly water tracking at apartments
Drone companies in India call for changes to government’s draft regulations
Geek Trivia: The Direction Of Clock Hand Movement Was Determined By?
Sunday, 14 January 2018
How to Pick the Right Monitor for Your PC
JEP-200: Remoting / XStream whitelist integrated into Jenkins core
Overview
JEP-200 has been integrated into Jenkins weekly builds and (if all goes well) will be a part of the next LTS line. In a nutshell, this change is a security hardening measure to be less permissive about deserializing Java classes defined in the Java Platform or libraries bundled with Jenkins. For several years now, Jenkins has specifically blacklisted certain classes and packages according to known or suspected exploits; now it will reject all classes not explicitly mentioned in a whitelist, or defined in Jenkins core or plugins.
For Jenkins administrators
Before upgrade
Back up your Jenkins instance prior to upgrade so you have any easy way of rolling back. If you are running any of the plugins listed in Plugins affected by fix for JEP-200, update them after taking the backup but before upgrading Jenkins core.
If you have a way of testing the upgrade in an isolated environment before applying it to production, do so now.
Using backups and a staging server is good advice before any upgrade but especially this one, with a relatively high risk of regression.
After upgrade
To the extent that advance testing of the impact of this change on popular plugins has been completed, most users (and even plugin developers) should not notice any difference. If you do encounter a java.lang.SecurityException: Rejected: some.pkg.and.ClassName in the Jenkins UI or logs, you may have found a case where an unusual plugin, or an unusual usage mode of a common plugin, violates the existing whitelist. This will be visible in the Jenkins system log as a message from jenkins.security.ClassFilterImpl like the following:
some.pkg.and.ClassName in file:/var/lib/jenkins/plugins/some-plugin-name/WEB-INF/lib/some-library-1.2.jar might be dangerous, so rejecting; see http://ift.tt/2mw8Zuj
where the link would direct you here.
If you find such a case, please report it in the Jenkins issue tracker, under the appropriate plugin component. Link it to JENKINS-47736 and add the JEP-200 label. If at all possible, include complete steps to reproduce the problem from scratch. Jenkins developers will strive to evaluate the reason for the violation and offer a fix in the form of a core and/or plugin update. For more details and current status, see Plugins affected by fix for JEP-200.
Assuming you see no particular reason to think that the class in question has dangerous deserialization semantics, which is rare, it is possible to work around the problem in your own installation as a temporary expedient. Simply make note of any class name(s) mentioned in such log messages, and run Jenkins with this startup option (details will depend on your installation method):
-Dhudson.remoting.ClassFilter=some.pkg.and.ClassName,some.pkg.and.OtherClassName
For plugin developers
Testing plugins against Jenkins 2.102 and above
As a plugin developer encountering this kind of error, your first task is to ensure that it is reproducible in a functional (JenkinsRule) test when running Jenkins 2.102 or newer to reproduce the error.
mvn test -Djenkins.version=2.102
The above assumes you are using a recent 2.x or 3.x parent Plugin POM. For certain cases you may need to use Plugin Compat Tester (PCT) to run tests against Jenkins core versions newer than your baseline.
Running PCT against the latest Jenkins core:
java -jar pct-cli.jar -reportFile $(pwd)/out/pct-report.xml \
-workDirectory $(pwd)/work -skipTestCache true -mvn $(which mvn) \
-includePlugins ${ARTIFACT_ID} -localCheckoutDir ${YOUR_PLUGIN_REPO}
You may need to run tests using an agent (e.g., JenkinsRule.createSlave) or force saves of plugin settings.
For maven plugins you can also specify custom Jenkins versions in Jenkinsfile to run tests against JEP-200:
buildPlugin(jenkinsVersions: [null, '2.102'])
(again picking whatever version you need to test against) so that the test is included during CI builds, even while your minimum core baseline predates JEP-200.
If your plugins are built with Gradle, your mileage may vary.
Making plugins compatible with Jenkins 2.102 or above
If you discover a compatibility issue in your plugin, you then have several choices for fixing the problem:
-
Ideally, simplify your code so that the mentioned class is not deserialized via Jenkins Remoting or XStream to begin with:
-
If the problem occurred when receiving a response from an agent, change your
Callable(orFileCallable) to return a plainer type. -
If the problem occurred when saving an XML file (such as a
config.xmlorbuild.xml), use a plainer type in non-transientfields in your persistable plugin classes.
-
-
If the class(es) are defined in the Java Platform or some library bundled in Jenkins core, propose a pull request adding it to
core/src/main/resources/jenkins/security/whitelisted-classes.txtinjenkinsci/jenkins. -
If the class(es) are defined in a third-party library bundled in your plugin, create a resource file
META-INF/hudson.remoting.ClassFilterlisting them. (example)-
You may also do this for Java or Jenkins core library classes, as a hotfix until your core baseline includes the whitelist entry proposed above.
-
-
If the class(es) are defined in a JAR you build and then bundle in your plugin’s
*.jpi, add aJenkins-ClassFilter-Whitelisted: truemanifest entry. This whitelists every class in the JAR. (example)
Geek Trivia: Which Of These Bands Derives Its Name From Military Slang?
Saturday, 13 January 2018
The Best (Actually Useful) Tech We Saw at CES 2018
Tech innovations, gadgets does not mean progress
Geek Trivia: Best Known For Jeans And Fashion, Which Of These Companies Now Has Significant Real Estate Holdings?
How to Produce an Outline
Friday, 12 January 2018
What is mDNSResponder, And Why Is It Running On My Mac?
How to Update Your Graphics Drivers for Maximum Gaming Performance
Five Fantastic Cable Organizers to Wrangle Your Messy Cables
Your desk and nightstand might be a mess of cables right now, but they don’t need to be.
Click Here to Continue Reading
How to Find and Delete Google Assistant’s Stored Voice Data
How To Quote Someone In WhatsApp
How to Overclock Your Graphics Card for Better Gaming Performance
Written Communication On the Job
NR Narayana Murthy & Kip Thorne on Interstellar experience, Elon Musk vs Jeff Bezos & encouraging pure sciences
iOS 11.2.2 Probably Won’t Slow Down Your iPhone That Badly
Geek Trivia: In Japan, Which Of These Things Are Deep Fried As An Autumn Treat?
Top Choices of WriteanArgumentativeEssay
EssayWriter at a Glance
Thursday, 11 January 2018
What to Do If You Forget Your Mac’s Password
Byte Size Tips: How to See Which Groups Your Windows User Account Belongs To
6 Fun Educational Toys and Apps To Teach Your Kids Coding
We live in a digital age and whether your child grows up to be an actual programmer or pursues another path, the structure and logic of programm…
Click Here to Continue Reading
How to Use Rulers in Microsoft Word
How to Produce Successful LinkedIn Recommendations
What To Do If Your Smartphone Is Hot
Five Ways to Free Up Space on Your Android Device
From smart poles, you can charge electric vehicles
RBI says not connected with study on Aadhaar security aspects
Is Apple Even Paying Attention To macOS Security Anymore?
2017 Year In Review
A new year is upon us, bringing refreshed optimism and new resolutions for many looking to the 12 months ahead. I’ve spent the holiday break reflecting on our past year and wanted to take a moment to share with you the tremendous confidence I have for our company and what we can achieve in 2018. […]
The post 2017 Year In Review appeared first on Hortonworks.
Geek Trivia: The “COCOM Limit” Is A GPS Security Regulation That Affects Recreational?
Software Programs to Assist Written Down College Reports
Wednesday, 10 January 2018
Mount a Windows Share in macOS and Have it Reconnect at Login
Databricks Runtime’s New DBIO Cache Boosts Apache Spark Performance
We are excited to announce the general availability of DBIO caching, a Databricks Runtime feature as part of the Unified Analytics Platform that can improve the scan speed of your Apache Spark workloads up to 10x, without any application code change.
In this blog, we introduce the two primary focuses of this new feature: ease-of-use and performance.
- Contrary to Spark’s explicit in-memory cache, this Databricks Runtime caching automatically caches hot input data for a user and load balances across a cluster.
- It leverages the advances in NVMe SSD hardware with state-of-the-art columnar compression techniques and can improve interactive and reporting workloads performance by up to 10 times. What’s more, it can cache 30 times more data than Spark’s in-memory cache.
Explicit Caching in Apache Spark
One of the key features in Spark is its explicit in-memory cache. It is a versatile tool, as it can be used to store the results of an arbitrary computation (including inputs and intermediate results), so that they can be reused multiple times. For example, the implementation of an iterative machine learning algorithm may choose to cache the featurized data and each iteration may then read the data from memory.
A particularly important and widespread use case is caching the results of scan operations. This allows the users to eliminate the low throughput associated with reading remote data. For this reason, many users who intend to run the same or similar workload repeatedly decide to invest extra development time into manually optimizing their application, by instructing Spark exactly what files to cache and when to do it, and thus “explicit caching.”
For all its utility, Spark cache also has a number of shortcomings. First, when the data is cached in the main memory, it takes up space that could be better used for other purposes during query execution, for example, for shuffles or hash tables. Second, when the data is cached on the disk, it has to be deserialized when read — a process that is too slow to adequately utilize the high read bandwidths commonly offered by the NVMe SSDs. As a result, occasionally Spark applications actually find their performance regressing when turning on Spark caching.
Third, having to plan ahead and explicitly declare which data should be cached is challenging for the users who want to interactively explore the data or build reports. While Spark cache gives data engineers all the knobs to tune, data scientist often find it difficult to reason about the cache, especially in a multi-tenant setting, where engineers still require the results to be returned as quickly as possible in order to keep the iteration time short.
The Challenge with NVMe SSDs
Solid state drives, or SSDs, have become the standard storage technology. While initially known for its low random seek latency, over the past few years, SSDs have also substantially increased their read and write throughput.
The NVMe interface was created to overcome the design limitations of SATA and AHCI, and to allow unrestrained access to the excellent performance provided by the modern SSDs. This includes the ability to utilize the internal parallelism and the extremely low read latency of flash-based storage devices. NVMe’s use of multiple long command queues, as well as other enhancements, allow the drives to efficiently handle huge number of concurrent requests. This parallelism-oriented architecture perfectly complements the parallelism of modern multi-core CPUs, and data processing systems like Spark.
With the NVMe interface, the SSDs are much closer in their properties and performance to the main memory than to the slow magnetic drives. As such, they are a perfect place to store the cached data.
Yet in order to fully leverage the potential of NVMe SSDs, it is not enough to simply copy the remote data into the local storage. Our experiments with AWS i3 instances showed that while reading commonly used file formats from local SSDs, it’s only possible to utilize a fraction of available I/O bandwidth.
The above graph shows the I/O bandwidth utilization for Spark against the local NVMe SSDs on EC2 i3 instance types. As shown, none of the existing formats can saturate the I/O bandwidth. The CPU-intensive decoding is simply too slow to keep up with the fast SSDs!
“It Just Works”
When designing the DBIO cache, we focused not only on achieving optimal read performance, but also on creating a solution which “just works,” with no added effort from the user required. The cache takes care of:
- Choosing which data to cache – whenever a remote file is accessed, the transcoded copy of the data is immediately placed in the cache
- Evicting long unused data – the cache automatically drops the least recently used entries when its allotted disk space is running out
- Load balancing – the cached data is distributed evenly across all the nodes in the cluster, and the placement is adjusted in case of auto-scaling and/or uneven utilization of the different nodes
- Data security – the data in the cache remains encrypted in the same way as other temporary files, e.g., shuffle files
- Data updates – the cache automatically detects when a file is added or deleted in remote location, and presents the up-to-date state of the data
Since Databricks Runtime 3.3, the DBIO cache is pre-configured and enabled by default on all clusters with AWS i3 instance types. Thanks to the high write throughput on this type of instances, the data can be transcoded and placed in the cache without slowing down the queries performing the initial remote read. Users who prefer to choose another type of worker nodes can enable caching using Spark configs (see the DBIO caching documentation page for more details).
For the clients who would rather explicitly pre-cache all the necessary data ahead of time, we implemented CACHE SELECT command. It eagerly loads the chosen portion of data into the DBIO cache. Users can specify a vertical (i.e., selected columns) and a horizontal (i.e., rows required to evaluate given predicate) slice of data to be cached.
Performance
To leverage the NVMe SSDs, rather than caching directly the “raw bytes” of the input, this new feature automatically transcodes data in a new ephemeral, on-disk caching format that is highly optimized, which offers superior decoding speed and thus better I/O bandwidth utilization. The transcoding is performed asynchronously to minimize the overhead for queries that load the data into the cache.
The enhanced reading performance (on top of the ability to avoid high latency normally associated with access to remote data) results in a substantial speed-up in a wide variety of queries. For example, for the following subset of TPC-DS queries, we see consistent improvement in every single query when compared to reading Parquet data stored in AWS S3, with as much as 5.7x speed-up in query 53.
In some customer workloads from our private beta program, we’ve seen performance improvements of up to 10x!
Combining Spark Cache and the DBIO Cache
Both Spark cache and the DBIO cache can be used alongside each other without an issue. In fact, they complement each other rather well: Spark cache provides the ability to store the results of arbitrary intermediate computation, whereas the DBIO cache provides automatic, superior performance on input data.
In our experiments, the DBIO cache achieves 4x faster reading speed than the Spark cache in DISK_ONLY mode. When compared with MEMORY_ONLY mode, the DBIO cache still provides 3x speed-up, while at the same time managing to keep a small memory footprint.
DBIO Cache Configuration
For all AWS i3 instance types while running Databricks Runtime 3.3+, the cache option is enabled by default for all Parquet files, and this cache feature also works seamlessly with Databricks Delta.
To use the new cache for other Azure or AWS instance types, set the following configuration parameters in your cluster configuration:
spark.databricks.io.cache.enabled true
spark.databricks.io.cache.maxDiskUsage "{DISK SPACE PER NODE RESERVED FOR CACHED DATA}"
spark.databricks.io.cache.maxMetaDataCache "{DISK SPACE PER NODE RESERVED FOR CACHED METADATA}"
Conclusion
The DBIO Cache provides substantial benefits to Databricks users – both in terms of ease-of-use and query performance. It can be combined with Spark cache in a mix-and-match fashion, to use the best tool for task at hand. With the upcoming further performance enhancements and support for additional data formats, the use of DBIO cache should become a staple for a wide variety of workloads.
In the future, we will be releasing more performance improvements as well as extend the feature to support additional file formats.
To try this new feature, choose an i3 instance type cluster in our Unified Analytics Platform today.
--
Try Databricks for free. Get started today.
The post Databricks Runtime’s New DBIO Cache Boosts Apache Spark Performance appeared first on Databricks.



